banner
Home / News / Explainable label guided lightweight network with axial transformer encoder for early detection of oral cancer | Scientific Reports
News

Explainable label guided lightweight network with axial transformer encoder for early detection of oral cancer | Scientific Reports

Feb 22, 2025Feb 22, 2025

Scientific Reports volume 15, Article number: 6391 (2025) Cite this article

Metrics details

Oral cavity cancer exhibits high morbidity and mortality rates. Therefore, it is essential to diagnose the disease at an early stage. Machine learning and convolution neural networks (CNN) are powerful tools for diagnosing mouth and oral cancer. In this study, we design a lightweight explainable network (LWENet) with label-guided attention (LGA) to provide a second opinion to the expert. The LWENet contains depth-wise separable convolution layers to reduce the computation costs. Moreover, the LGA module provides label consistency to the neighbor pixel and improves the spatial features. Furthermore, AMSA (axial multi-head self-attention) based ViT encoder incorporated in the model to provide global attention. Our ViT (vision transformer) encoder is computationally efficient compared to the classical ViT encoder. We tested LWRNet performance on the MOD (mouth and oral disease) and OCI (oral cancer image) datasets, and results are compared with the other CNN and ViT (vision transformer) based methods. The LWENet achieved a precision and F1-scores of 96.97% and 98.90% on the MOD dataset, and 99.48% and 98.23% on the OCI dataset, respectively. By incorporating Grad-CAM, we visualize the decision-making process, enhancing model interpretability. This work demonstrates the potential of LWENet with LGA in facilitating early oral cancer detection.

Oral cavity cancer is a global health concern as its diagnosis is complex and needs early-stage treatment. In the past decade, several types of mouth and oral cancer, such as tongue, lips, cheeks, floor of mouth and pharynx numerous cases have been reported. The cancer cell has abnormal growth near the effect region1. The major causes of cancer include lifestyle, genetic environment, alcohol consumption and use of tobacco. Typically, oral cancer needs immediate treatment to stop the spread of the disease to the next stage2. The common symptoms of the oral cavity are ulcers in the mouth, the presence of lumps in the mouth or throat, white patches (leukoplakia), and red patches (erythroplakia). The dentist or oral surgeon performs the initial clinical examination by looking at different signs of the tissue. After that, they may conduct a biopsy, removing abnormal tissue. Other methods include pathological examination to determine the cancer cell and its stage3.

The recent advancement in digital imaging enabled medical practitioners to diagnose cancer using X-ray, CT-scan and MRI (Magnetic Resonance Imaging) images. The researchers use these images to develop an automated system for cancer diagnosis4. Machine learning and deep learning are widely used for the analysis of cancer images. In the machine learning-based method, the expert uses cancer cell features like tissue shape, colour, and texture to train the algorithm. Moreover, the convolution neural network (CNN) is specialized in automatic feature analysis of the effect region. However, designing a robust CNN model is a complex task and requires a high volume of data for training and testing5.

In this study, we developed a lightweight, explainable CNN model with label-guided attention (LGA) for diagnosing oral cavity cancer. The LWENet has four depth-wise convolution layers that reduce the computation burden. In addition, a LGA modules is added to the model, which improves the label consistency and spatial features. Furthermore, AMSA based ViT encoder incorporated in the model to provide global attention with less computation heads. We tested our model performance on the MOD and OCI datasets. The LWENet results are superior to those of DenseNet, ShuffleNet, ResNeXt, MnasNet, and CellViT. In addition, the Grad-CAM module is incorporated to visualize the model’s decision process.

The contribution of the manuscript is as follows.

We utilized a depth-wise convolution layer to reduce the computation costs for diagnosing the cancer disease and. incorporated the LGA module into the model to improve the label consistency and spatial features.

We design an AMSA-based ViT encoder to provide global attention to the cancer region, which is computationally efficient compared to the classical ViT encoder.

A hybrid loss function is designed to evaluate the model performance on the MOD and OCI datasets. In addition, we incorporated the Grad-CAM module in the model to visualize the model’s decision process.

The model’s performance is compared with six state-of-the-art CNN and ViT-based models on the MOD and OCI datasets.

The rest of the manuscript is organized as follows.

Mouth and oral cancer diagnosis using different methods is presented in Sect. 2, Whereas the architecture of the LWENet is elaborated in Sect. 3. The dataset description and results are described in Sect. 4; the detailed discussion of the method and comparative study is presented in Sect. 5. Finally, the conclusion of the method and future scope is available in Sect. 6.

Mouth cancer is the broader context of oral cancer. Oral cancer is restricted to the oral cavity, whereas mouth cancer includes the oral cavity and extends to the throat, soft palate, and tonsillar regions. Furthermore, oral cancer has red or white patches and non-healing ulcers. At the same time, mouth cancer has symptoms like enlarged lymph nodes, swallowing, and changes in speech. Huang et al.6 introduced a CNN-based technique for oral cancer detection using Particle Swarm Optimization and Seagull Optimization Algorithms. Advanced image processing was applied to the OCI dataset, and the performance of CNN, R-CNN, and ResNet-101 was compared, achieving the highest accuracy rate of 96.94%. In similar research, Raval et al.7 performed a comparative study of CNN models such as VGGNet, AlexNet, DenseNet, Inception, ResNet, and GNN (graph neural network) for detecting skin and oral cancer. In addition, data augmentation increases the training dataset to boost the robustness of the models, while image-processing techniques reduce noise and improve image quality.

Joseph et al.8 claim that oral CDx brush biopsy is highly accurate for detecting precancerous and cancerous lesions. Regular examinations are recommended for treated patients to monitor secondary tumours. In addition, public awareness of cancer warning signs and the risks of tobacco and alcohol are vital to reducing disease rates. Mira et al.9 developed a smartphone-based diagnosis method using deep learning, capturing high-quality mouth images with a centered rule and reducing variability through resampling. A dataset of five oral diseases was created. They utilized a new deep-learning network for oral cancer diagnosis. This technique achieved 83.0% sensitivity, 96.6% specificity, 84.3% accuracy, and 83.6% F1 score on 455 test images.

Myriam et al.10 propose an approach for detecting oral cancer using a convolutional neural network (CNN) and an optimized deep belief network (DBN), with design parameters enhanced by a hybrid Particle Swarm Optimization and Al-Biruni Earth Radius Optimization algorithm (PSOBER). This approach achieves promising results, outperforming various methods with an accuracy of 97.35%. Welikala et al.11 developed a comprehensive oral lesion library combining global expert annotations. They tested ResNet-101 for image classification and faster R-CNN for object detection in early cancer detection. Their method achieved F1 scores of 87.07% for identifying lesions and 78.30% for referral needs, while object detection scored 41.18% for lesions requiring referral.

Badawy et al.12 proposed a new framework for accurate oral cancer classification using CNNs optimized by Aquila and Gorilla Troops Optimizers. They tested pre-trained models with the Histopathologic Oral Cancer Detection dataset, including VGG19, MobileNet, and DenseNet201. The DenseNet201 model achieved the highest accuracy, with 99.25% using Aquila Optimizer and 97.27% with Gorilla Troops Optimizer, outperforming other models. Jubair et al.13 developed a lightweight deep CNN for classifying oral lesions as benign or malignant using real-time clinical images, leveraging a pre-trained EfficientNet-B0 for transfer learning. The model achieved 85.0% accuracy, 84.5% specificity, and 86.7% sensitivity.

Rahman et al.14 propose a transfer-learning model using AlexNet to extract features from oral squamous cell carcinoma (OSCC) biopsy images. Their model achieved 97.66% accuracy on training and 90.06% on testing, showing significant effectiveness. Omeroglu et al.15 propose a hybrid deep-learning model using soft attention techniques to classify skin lesions with multiple labels. The modified Xception architecture extracts feature with a multi-branch structure, enhanced by a soft attention module. Their method achieved an average accuracy of 83.04% in multi-label skin lesion classification.

Wang et al.16 introduce a hybrid CNN-GRU (Convolutional Neural Networks and Gated Recurrent Units) model for detecting breast IDC (Invasive ductal carcinoma) in whole slide images from the PCam Kaggle dataset. This approach addresses the challenges of manual detection by combining CNN and GRU (Gated Recurrent Units) layers to improve accuracy. Their model achieved performance measures of 86.21% accuracy, 85.50% precision, 85.60% sensitivity, 84.71% specificity, 88% F1-score, and 0.89 AUC (Area Under Curve), surpassing pathologist error and misclassification issues. Li et al.17 proposed a Raman spectroscopy system for rapid intraoperative detection of laryngeal carcinoma using PCA (Principal Component Analysis), RF (Radio Frequency), and 1D CNN methods. They measured Raman spectra from 207 normal and 500 tumor sites from 10 surgical specimens. RF analysis yielded 90.5% accuracy, 88.2% sensitivity, and 92.8% specificity. The 1D CNN achieved 96.1% accuracy, 95.2% sensitivity, and 96.9% specificity. Nasir et al.18 present a deep learning-based system for automatically detecting osteosarcoma using whole slide images (WSIs). Osteosarcoma, a malignant tumour affecting long bones in children, requires early detection to reduce mortality, but manual detection is expert-intensive and tedious. The proposed system automates image analysis for faster processing. Experiments on a large dataset of WSIs yielded up to 99.3% accuracy. Uthoff et al.19 proposed a dual-modality oral cancer-screening device for remote areas, using auto fluorescence imaging (AFI) and white light imaging (WLI) on a smartphone. A custom Android app manages LED illumination and image capture. The remote specialist and a CNN classified 170 image pairs into suspicious and not suspicious with an accuracy of 94.94%,

Xu et al.20 proposed a 3DCNN-based algorithm for early oral cancer diagnosis and compared it with a 2DCNN. The 3DCNN extracts high-dimensional spatial features and achieves better performance than the 2D CNN. This indicates that 3D CNNs could enhance CT-assisted diagnosis systems. Rajaguru et al.21 proposed the ABC-PSO algorithm and BLDA for classifying oral cancer risk levels, emphasizing the importance of early detection for improving survival rates. Their results showed that the Hybrid ABC-PSO classifier achieved a classification accuracy of 100%, while the BLDA classifier achieved an accuracy of about 83.16%. This highlights the effectiveness of the Hybrid ABC-PSO classifier in early and accurate detection, potentially leading to better patient outcomes despite the high mortality associated with late-stage detection.

Chan et al.22 proposed a deep convolutional neural network (DCNN) that uses texture maps to detect cancerous regions and mark the ROI. The DCNN has two branches: one for cancer detection and the other for semantic segmentation. The texture-map-based branch-collaborative network employs texture images and a sliding window to create a texture map. Experimental results showed average detection sensitivity and specificity of 0.9687 and 0.7129, respectively, using wavelet transform and 0.9314 and 0.9475 using the Gabor filter. Lin et al.23 proposed a centered rule image-capturing approach for oral cavity images, creating a dataset with five oral disease categories and reducing smartphone camera variability. Their HRNet achieved 83.0% sensitivity, 96.6% specificity, 84.3% precision, and an F1 score of 83.6% on 455 test images. Bakare et al.24 perform pre-processing and feature extraction for oral cancer. The pre-processing uses a median filter to reduce noise, the feature extraction module extracts temporal features, and the classification module uses SVM and KNN algorithms. Results showed the SVM classifier outperformed KNN, achieving 98% accuracy on 1224 histopathological images, compared to 83% with KNN.

In this study, we design lightweight explainable sequential CNN with label-guided attention (LWENet) for mouth cancer diagnosis. Our model has four convolution layers. Each convolution layer is followed by batch normalization (BN) and ReLU (Rectified Linear Unit) activation function, shown in Fig. 1. We can see in Fig. 1 that the first, second, third and fourth convolution block has 32, 64, 128 and 256 filters with size 3 × 3. In addition, after the third convolution block, we applied label-guided attention (LGA) to ensure the consistency of the label. Furthermore, an AMSA-based ViT encoder is applied to the flattened features obtained from the Conv4, which provides global attention to the spatial features and a softmax layer on the top of the model is added to diagnose mouth cancer.

The architecture of the proposed model for mouth cancer diagnosis.

Let the input image \(I \in {{\mathbb{R}}^{H \times W \times D}}\), where, H, W, D are height, width and depth, be fed to the model. The Conv1 has 32 filters, each of size 3 × 3, followed by batch normalization (BN) and ReLU activation function. The spatial feature extracted from the Conv1 is defined as follows.

The extracted features are passed to Conv2 and Conv3, which perform convolution using 64 and 128 filters, each of size 3 × 3, followed by BN and ReLU activation as follows.

Let the feature extracted from the intermediate convolution block is \(Z \in {{\mathbb{R}}^{H \times W \times D}}\). The linear projection to the feature along the intermediate space along the weight matrix is defined as follows.

Where Z = Spatial feature map of third convolution layer, \({Z^\prime } \in {{\mathbb{R}}^{H \times W \times {D^\prime }}}\) = Linearly projected spatial feature and \({W_{proj}} \in {R^{1 \times 1 \times D \times {D^\prime }}}\) = Learnable weight matrix.

The distance between \({Z^\prime } \in {{\mathbb{R}}^{H \times W \times {D^\prime }}}\) and its neighbor for each pixel (i, j) is calculated as follows

Where \({N_{i,j}}\)=Set of neighbor pixels, \({\left\| {} \right\|_2}\)=L2 norm, N = 8, \(d \in {{\mathbb{R}}^{H \times W \times N}}\)=distance matrix. The multilayer perceptron (MLP) transforms distance (d) using two linear projections with a Gaussian error linear unit (GeLU) as follows.

Where, \({W_1} \in {{\mathbb{R}}^{1 \times 1 \times {K^\prime } \times N}}\) and \(W \in {{\mathbb{R}}^{1 \times 1 \times N \times {K^\prime }}}\) are the weight matrix and \({K^\prime }\)= Intermediate dimension. The feature affinity (FA) between neighbor pixels (i, j) is calculated as follows.

Where, \(\alpha \)=Learnable parameters. After that FA is normalized using softmax along the neighbor dimension as follows.

Where \( i \in \left\{ {1,2,3, \ldots N} \right\} \) and \(NF{A_{i,j}}\)=Normalized feature affinity of pixel (i, j). Once normalized FA is calculated it is used to reconstruct the features using project features\({Z^\prime }\), skip connection, ReLU activation and BN as follows.

where, Fout= Feature map obtained from LGA, \({Z^{\prime\prime}_{i,j}}\)= Reconstructed features of pixel i and its neighbor j. The block diagram of the LGA is shown in Fig. 2. In Fig. 2, we can see that the feature (Z) extracted from the intermediate third convolution block is normalized to Z’. After that, distance (d) is calculated between Z’ and its neighbor pixels, and multilayer perceptron (MLP) transforms distance d to d’ using two linear projections with a GeLU. Furthermore, feature affinity (FA) between neighbor pixels (i, j) is calculated and normalized using Softmax to NFA. Finally, output features (Fout) are constructed using project features\({Z^\prime }\), skip connection, BN and ReLU activation.

Illustration of LGA block working process.

We applied LGA (label-guided attention) after Conv3, which performs linear projection using Conv2D of kernel size 1 × 1, followed by the distance calculation between SF3 and its neighbor. Furthermore, the cancer region’s local texture pattern and sparsity are captured by transforming the distance using an MLP (multi-layer perceptron). The pixel label inference after Conv3 using LGA is calculated using Eq. (10). After that, enhance spatial feature is passed to the Conv4 block of filter size 256, shown in Eq. (11) and passed to the second LGA block.

The spatial feature extracted from the SF5 is flattened and fed to the axial multi-head self-attention (AMSA) based ViT encoder.

The ViT is widely used in several domains of the medical image diagnosis, due to its capability of providing long range dependency to the features. The computation costs of the multi-head self-attention (MHSA) in the classical ViT is relatively high, which makes them less suitable choice for the real-time diagnosis system. In the proposed study, we design AMSA attention based ViT encoder, which provides the attention in in row-wise and column-wise shown in Fig. 3. It has major components layer norm, AMSA, and MLP (multilayer perceptron). The flattened features obtained from the SF5 are used to generate tokens for the ViT encoder. Let the output feature map from the convolution block is \(F \in {R^{H \times B \times C}}\), here H, B, and C are height, width and channel. We flattened the features map into row-wise and column wise considering each row and column as a sequence of tokens. The row and column having tokens dimension \(F{}_{{row}} \in {R^{H \times B \times C}},{F_{col}} \in {R^{B \times H \times C}}\). Let total heads is h and each head having dimension d/m, here d is total feature dimension after projection. For each head we independently calculated query(Q), key(K) and value(V) in row and column as follows.

Here, \(W_{Q}^{j},W_{K}^{j},W_{V}^{j} \in {R^{C \times d/m}}\). After that we calculated multi-head self-attention in each row indecently for each head as follows.

Here, \(O{A_{row}}\)=Overall attention from all head in row-wise. Similarly, we calculated multi-head self-attention in each column for each head and overall attention in the column.

We fused the attention obtained from the row and column wise as follows.

The Lth layer of the proposed ViT encoder is summarized as follows.

Illustration of the AMSA based ViT encoder.

Subsequently, a softmax layer on the top of the model is applied on the top of the model that converts logits into their corresponding classes.

where, C = Number of classes, we set C = 2 for OCI and C = 7 for MOD dataset, Fout = Feature input vector extracted from ViT encoder and Pout= Probability value of the class.

We designed the loss function using binary cross entropy (BCE) and contrastive loss (CL) functions for the OCI dataset. Meanwhile, loss on the MOD dataset is calculated using categorical cross entropy (CCE) and contrastive loss. The BCE is well suited for the gradient and maintains efficient optimization during training. Generally, the BCE loss calculates the average disparity among the anticipated labels and ground truth on a per-pixel basis. The CL simultaneously penalizes the dissimilarity of features between pixel pairs with the same label and feature resemblance between pairs with different labels. Mathematically, the BCE, CL and CCE is expressed as follows.

where N = Total images in a batch, we set 32 for our experiment, \({x_i}\)= True label, \({y_i}\)= Predicated probability of the ith class, \({d_{i,j}}\) = Distance between feature samples i and j, m = Margin parameter, M = Number of classes, for MOD we set 7. The loss calculation on the OCI dataset is represented as follows.

where, \(\alpha \)and \(\beta \) are hyperparameters and set to 0.1. The \({L_{OCI}}\)=Loss on the OCI dataset. We defined the loss function for the MOD dataset as follows.

where, \(\gamma \) and \(\lambda \) are hyperparameters, we set the value of these two parameters to 0.5 for our experiment. The computation complexity for the row wise multi-head self-attention is \(O({H^2}.d/h)\)and column wise attention is \(O({B^2}.d/h)\) and overall complexity is \(O(h.(B.{H^2})+H.{B^2}).d/h)\) much smaller than the MHSA (multi-head self-attention) of the classical ViT model. The algorithm of the proposed method is as follows.

Oral cavity cancer diagnosis using LWETNet.

In this section, we present a description of the dataset and the results of the MOD and OCI datasets.

The MOD (Mouth and Oral Disease) dataset consists of photos from dental facilities in Okara, Punjab, Pakistan, and other sources such as dentistry websites. It has 517 images divided into 7 disease classes, including Canker sores (CaS), Cold sores (CoS), Gingivostomatitis (Gum), Mouth cancer (MC), Oral cancer (OC), Oral lichen planus (OLP), and Oral thrush (OT). In addition, images have a resolution of 256 × 256 × 3. Since the dataset size is small, a data augmentation technique such as horizontal flip, rotation, vertical flip and shear is applied to increase dataset size. After augmentation, 5170 images were kept for training and validation25.

We utilized the OCI (oral cancer image) dataset to examine the proposed method. Shivam Barot and Prakrut Suthar carefully collected the dataset from multiple Ear, Nose, and Throat (ENT) hospitals in Ahmedabad, India. Furthermore, the dataset has two unique categories, malignant and non-cancerous, with 87 and 47 images, respectively. The variety of instances enables thorough validation of many types of diagnostic equipment. Subsequently, these images were carefully categorized with the aid of ENT experts to guarantee precise labelling. The images have varying resolutions and are stored in .jpeg format. Researchers and practitioners can access the database through the Kaggle website26. We separately applied data augmentation techniques to each class of datasets to increase the size and save the images in their corresponding classes. After that, a 5-fold cross-validation technique is applied to the dataset to avoid bias performance. Figure 4 shows sample images of MOD and OCI datasets.

Top row sample images of OCI and bottom row sample image of MOD dataset.

We added a summary of original and augmented images in each class of the MOD and OCI datasets as shown in Table 1. We applied 10-fold and 20-fold increments in the MOD and OCI datasets, each class image separately. After that, a 5-fold cross-validation scheme was applied to evaluate model performance on both datasets.

We experimented on MOD and OCI datasets using a script written in Python 3.10 on NVIDIA QUADRO RTX-4000 GPU. The machine has a Windows 10 operating system and 128GB RAM. Furthermore, each experiment is conducted for 50 epochs using the Adam optimizer with an initial learning rate of 0.0001 and batch size of 32.

The MOD dataset contains an unequal number of images in their 7 classes with a resolution of 256 × 256 × 3. We applied a 5-fold cross-validation technique to diagnose mouth and oral cancer diseases. In a 5-fold cross-validation scheme, the MOD dataset is divided into five equal parts. Out of these, 4 parts are used for training and one part for validation. The LWENet is trained for 50 epochs using the Adam optimizer with an initial learning rate of 0.0001. After training the model, confusion matrices for each fold are plotted, shown in Fig. 5. In fold 1, we have 33 FP (false positive) and 25 FN (false negative values). On the other hand, fold 2 showed 24 FP and 19 FN values. In the substituent folds, FP and FN values showed decreasing trends. Moreover, fold 5 has 4 FP and 7 FN values.

The confusion matrix on MOD dataset (a) fold1 (b) fold2 (c) fold3 (d) fold 4 and (e) fold 5.

The performance indicator precision, recall, F1-score, accuracy, and kappa for each fold are calculated using the mathematical description stated in27 and shown in Table 2. In Table 2, we notice that fold 1 has precision and kappa values of 94.26% and 93.40%, respectively. On the other hand, fold 2 showed improvement in the F1 score and recall value. Furthermore, the fold3 model achieved 97.07% precision and 97.96% classification accuracy. Moreover, the average precision and kappa value on the MOD dataset are 94.09% and 96.30%, respectively.

The OCI dataset has fewer images, which may cause the overfitting problem. Therefore, we applied a data augmentation technique such as horizontal flip, vertical flip, rotation, and shear to increase the dataset size 20 times. After data augmentation, we have 1740 images in cancer and 880 images in non-cancer disease. Images in the dataset have varying resolution. Therefore, we resized the image to 256 × 256 × 3 and fed it to the model for training using a 5-fold cross-validation scheme. The LWENet is trained for 50 epochs using the Adam optimizer with an initial learning rate 0.0001. After training the model, confusion matrices for each fold are plotted, shown in Fig. 6. In fold 1, 7 FP and 8 FN values. On the other hand, fold 2 showed 5 FP and 7 FN values. In the substituent folds, FN and FP values showed decreasing trends. Moreover, fold 5 has 0 FP and 2 FN values.

The confusion matrix on MOD dataset (a) fold1 (b) fold2 (c) fold3 (d) fold 4 and (e) fold 5.

The performance indicator precision, recall, F1-score, accuracy, and kappa for each fold are calculated and depicted in Table 3. In Table 3, we notice that fold 1 has precision and kappa values of 96.72% and 93.60%, respectively. On the other hand, fold 2 showed improvement in the F1 score and recall value. Furthermore, the fold3 model achieved 98.29% precision and 98.66% classification accuracy. Moreover, the average precision and kappa value on the OCI dataset are 96.22% and 96.58%, respectively.

The recent advancement in deep learning has shown tremendous use in medical imaging to diagnose pneumonia, skin cancer, and breast cancer. In addition, deep learning is widely used in treating heart disease, lung cancer, kidney disease, and Parkinson’s disease. The annotated dataset’s scarcity is taken care of by the GAN-based model and federated learning. Moreover, deep learning models help experts to make precise diagnoses in critical surgery.

Mouth and oral cancer are a dangerous disease that needs attention at an early stage. Machine learning improved cancer diagnosis performance. However, the handcrafted features used in machine learning techniques are experts’ dependent28. Recently, the CNN-based method improved the performance of the mouth and oral cancer disease diagnosis by extracting high dimensional spatial features. However, classical CNN ignores the small and boundary region in several applications29. The ViT (Vision Transformer) and attention-based models that handle these challenges have been tested in medical domains and achieved satisfactory results30,31,32. In the proposed study, we design lightweight, explainable sequential CNN with label-guided attention (LWENet) for the mouth and oral cancer diagnosis. The LWENet has four depth-wise separable convolution layers followed by BN and ReLu activation. After the third convolution layer, LGA is added to the model to improve the labelling consistency. This makes the model more focused on useful spatial features in different cancer regions. Furthermore, a transformer encoder with ASMA is utilized to provide global attention. In addition, the computation liability of the proposed ViT encoder is less compared to the classical ViT encoder.

We evaluated model performance on the MOD and OCI datasets and achieved satisfactory results with lower computation costs. Further, the LWENet performance is compared with DenseNet33, ShuffleNet34, ResNeXt35, MnasNet36 and CellViT37. The DenseNet, ShuffleNet, ResNeXt and MnasNet are CNN-based models. On the other hand, the CellViT is a vision transformer-based model. The DenseNet is 121 layers deep and divided into 4 dense blocks of 6, 12, 24 and 16 layers. On the other hand, ShuffleNet has 50 layers with depth-wise convolution.

Moreover, ResNeXt is a residual network having 50 layers. The MnasNet has 53 layers with 5 stages of MBConv layers, and some layers contain Squeeze-and-Excitation (SE) blocks. Meanwhile, the CellViT contains a transformer block as an encoder to enhance image spatial feature co-relation. Each model is evaluated under the same experimental settings for a fair comparison. After, training performance measures are depicted in Tables 4 and 5 for the MOD and OCI datasets.

Table 4 shows that ShuffleNet achieved the lowest precision and accuracy of 80.24% and 83.45%, respectively. The DenseNet has improved F1-score and Kappa values of 82.75% and 84.23%, respectively. Meanwhile, ResNeXt and MnasNet’s performance measures are more than 85%. At the same time, CellViT has better precision and recall value than classical CNN-based methods. Moreover, the proposed LWENet achieved the highest kappa and accuracy of 96.30% and 97.86%, respectively.

Table 5 shows that DenseNet achieved the lowest precision and accuracy of 84.06% and 86.42%, respectively. The ShuffleNet has improved its F1 score and Kappa value by 87.89% and 90.03%, respectively. Meanwhile, ResNeXt and MnasNet have Kappa of more than 91%. At the same time, CellViT has better precision and recall value than classical CNN-based methods. Moreover, the proposed LWENet achieved the highest Kappa and accuracy of 96.58% and 98.99%, respectively.

Visualizing the model training and validation process in each epoch is essential for analyzing the results. Figure 7 depicts the training and validation loss as well as the accuracy of the model on the MOD and OCI datasets. In Fig. 7a, we can observe that training loss and validation loss on the MOD dataset are close to 0.5 and 1.6, respectively. After 10 epochs, it starts decreasing and reaches below 0.2 at 30 epochs. On the other hand, on the OCI dataset, initial training and validation loss is close to 0.5. After 5 epochs, it starts decreasing and reaches below 0.1 at 10 epochs. Moreover, the training and validation accuracy reaches more than 95% after 10 epochs and more than 97% after 20 epochs on the MOD dataset. Meanwhile, training and validation accuracy reaches more than 90% after 10 epochs on the OCI dataset.

The training and validation loss and accuracy on MOD and OCI datasets are shown in (a–d) respectively.

The ROC (receiver operating characteristic) plot of LWENet, DenseNet, ShuffleNet, ResNeXt, MnasNet and CellViT is shown in Fig. 8. It provides a tradeoff between TPR (true positive rate) and FPR (false positive rate) at different threshold values. We can notice in Fig. 8a that the AUC (area under the curve) values of the DenseNet and ShuffleNet are 0.8827 and 0.8567, respectively. Meanwhile, ResNeXt, MnasNet, and CellViT have an AUC value of more than 0.9. Moreover, the proposed LWENet achieved a 0.9897 AUC value on the MOD dataset. On the other hand, the AUC value of the DenseNet, ShuffleNet and ResNeXt is more than 0.9 on the OCI dataset. Meanwhile, MnasNet and CellViT achieved more than 0.95. Moreover, LWENet obtained an AUC value close to 1. The high AUC value of the LWENet on MOD and OCI datasets confirms their effectiveness for diagnosing mouth and oral cancer.

ROC-based comparison of the LWENet with other methods on (a) MOD and (b) OCI datasets.

In this section, we present the effects of different components on evaluating model performance on the MOD and OCI datasets. Table 6 presents the summary of the precision and accuracy of the MOD and OCI datasets under different components of the model. We can see that the base CNN model achieved precision and accuracy of 91.28% and 93.14%, respectively. However, including the LGA block in the CNN model improved the performance by over 2%. Furthermore, CNN + LGA + ViT(MHSA) obtained a precision of 93.15% and 95.14% on the MOD and OCI datasets. Moreover, our CNN + LGA + ViT (AMSA) improved precision and accuracy by more than 2% on both datasets.

The hyperparameters, such as batch size (BS) and learning rate (LR), play crucial roles in model training and generalization. We plotted the bar plot shown in Fig. 9 of the BS and LR effects on MOD and OCI datasets. In Fig. 9a, the LR value 0.0001 and BS value 32 showed the highest classification accuracy. Increasing BS and LR rate has slightly less performance. Similarly, the OCI dataset, with a BS value of 32 and an LR value of 0.0001, achieved the highest classification accuracy of 98.99%. When BS increased to 64 and LR was set to 0.00001, the model achieved 96.65% classification accuracy.

Effect of BS and LR on (a) MOD and (b) OCI datasets.

The training and validation time LWENet, DenseNet, ShuffleNet, ResNeXt, MnasNet and CellViT on the MOD and OCI datasets is shown in Fig. 10. The training time of the CellViT on the MOD dataset is the highest since the multi-head self-attention mechanism requires many operations to compute attention. Meanwhile, MnasNet completed training in 461 min (m). The ShuffleNet and ResNeXt completed training in 387 and 288.4 min, respectively. The second lowest training time required for DenseNet. The lowest time by the proposed LWENet on MOD and OCI datasets. Furthermore, the validation time of the CellViT is high on both datasets. Meanwhile, ResNeXt and MnasNet have comparable validation times. The LWENet require the second lowest validation time on the OCI dataset. While DenseNet has the lowest validation time.

Training and validation time on MOD and OCI datasets.

In medical imaging, interpretation of the diagnosis process is essential to get in-depth insight into the disease38. In addition, an explanation of the results helps the expert understand the cancer region in the image. Figure 11 shows the Grad-CAM results of our model; in the left part of the original image, the middle Grad-CAM using base CNN and the right side Grad-CAM results of the base CNN with LGA are presented. We can observe the base CNN model Grad-CAM has a slight deviation from the cancer region. Meanwhile, CNN + LGA shifts the Grad-CAM results closer to the mouth and cancer region.

Interpretation of the model decision for mouth and oral cancer diagnosis.

We compared LWENet performance with DenseNet, ShuffleNet, ResNeXt, MnasNet and CellViT, which are SOTA CNN and ViT-based models. The LWENet is also compared with ConvNeXt39, Swin-T40, MobileViT41, and MobileNetV342 using experimental conditions discussed in Sect. 4.2. The results on the MOD and OCI datasets are shown in Tables 7 and 8, respectively; we can see that MobileNetV3’s performance is the lowest. At the same time, MobileViT has a precision of 87.18% and 90.73% on the MOD and OCI datasets. Furthermore, Swin-T obtained a kappa score of 90.60% and 93.04% on the MOD and OCI datasets. Moreover, the LWENet achieved higher performance measures than other methods.

The deep learning and ViT-based methods require extensive data for the model’s training. Training of models on small datasets may lead to overfitting problems. Therefore, we applied a data augmentation technique to increase the size of the dataset and evaluated model performance. Furthermore, to generalize the model, we train LWENet on the dataset proposed in the study43. This dataset contains 3000 images stored in JPEG format and is categorized into four classes: OCA (prognosis of oral cancer), OPMD (oral potentially malignant disorders), benign and healthy. The OCA, OPMD, benign and healthy classes contain 129, 1394, 748 and 729 images, respectively. The sample images of the dataset are shown in Fig. 12.

Sample images of the dataset43.

We trained LWENet on the above-mentioned dataset and tested it on the OCI dataset under the same experimental settings discussed in Sect. 4.2. The quantitative results are presented in Table 9. Table 9 shows that our model achieved precision and Kappa values of 94.16% and 93.78%, respectively. Moreover, the LWENet obtained a classification accuracy of 95.37%.

We showed the effect of CL on the MOD and OCI datasets, as shown in Table 10. Table 10 shows that CL on the MOD dataset improved performance measures at a higher rate in the class having fewer samples. At the same time, there is less improvement in precision value in the class with more samples. Furthermore, on the OCI dataset, the malignant class precision value increment rate is lower than that of the healthy class.

We present the complexity analysis in terms of the number of flops and parameters used in the MobileNetV3, MobileViT, Swin-T, ConvNeXt and LWENet in Table 11. Table 11 shows that MobileNetV3 has fewer parameters and flops. At the same time, MobileViT has 10.6 M parameters and 3.9 GFLOPs. Furthermore, the Swin-T and ConvNeXt base model has 88 M and 50 M parameters. Moreover, our LWETNet has 12.8 M parameters, slightly higher than MobileViT. However, the performance of the LWETNet on the MOD and OCI datasets is better compared to MobileNetV3 and MobileViT.

The morphological characteristics, such as shape and texture, are very similar in many classes, which makes the task more challenging. In addition, annotated datasets are limited. The LWENet extracts high-dimensional spatial features from images of the mouth and oral cancer. Furthermore, LGA blocks help to focus the model on the edge and boundary region. Moreover, the transformer encoder with AMSA enriches the model with global contextual information. The proposed model weight can be used to develop and incorporate into other applications, including mobile devices and cloud-based systems, to provide a second opinion to the oncologist.

Mouth and oral cancer need attention at the early stage so that patient lives can be saved. The utilization of machine learning and deep learning methods to develop automated systems for cancer diagnosis has grown recently. However, a robust and interpretable system is needed for a rapid and in-depth analysis of the disease. We proposed lightweight CNN with LGA for the cancer diagnosis. The LGA block improves the label consistency of the neighbor pixels in the image. This makes the model focus on the relevant region and improves spatial features vector for cancer classification. Furthermore, the AMSA-based ViT encoder enhances the model’s overall performance with less computation burden. The effectiveness of the LWENet is tested on the MOD and OCI cancer datasets. The model achieved precision and accuracy of 95.71% and 97.86% on the MOD dataset, respectively. Whereas the precision and accuracy values on the OCI dataset are 96.98% and 98.99%, respectively. In addition, the model exhibits higher Kappa and recall values compared to another CNN and ViT-based method. Furthermore, the integration of Grad-CAM presents insight into the model decision process that can be used to provide a second opinion to the doctors. The computation cost of the LGA block in the LWENet can be reduced for the real-time application. In addition, the model needs to be tested on the real-time dataset so that its effectiveness can be further evaluated. Future studies on mouth and oral cancer diagnosis can be explored using a more diverse dataset. In addition, multiscale and nature-inspired algorithms can be used for better results.

Data Availability Statement: The data of the present study can be downloaded from the URL: https://www.kaggle.com/datasets/javedrashid/mouth-and-oral-diseases-mod, https://www.kaggle.com/datasets/shivam17299/oral-cancer-lips-and-tongue-images.

Chamoli, A. et al. Overview of oral cavity squamous cell carcinoma: risk factors, mechanisms, and diagnostics. Oral Oncol. 121, 105451 (2021).

Article PubMed MATH Google Scholar

Ghufran, M. S., Soni, P. & Duddukuri, G. R. The global concern for cancer emergence and its prevention: a systematic unveiling of the present scenario. In Bioprospecting of Tropical Medicinal Plants (1429–1455). Cham: Springer Nature Switzerland. (2023).

Chapter Google Scholar

Natarajan, E. Benign lumps and bumps. In Dental Science for the Medical Professional: An Evidence-Based Approach (163–199). Cham: Springer International Publishing. (2023).

Chapter MATH Google Scholar

Koundal, D. & Sharma, B. Challenges and future directions in neutrosophic set-based medical image analysis. In Neutrosophic Set in Medical Image Analysis (313–343). Academic. (2019).

Kaushal, C., Bhat, S., Koundal, D. & Singla, A. Recent trends in computer assisted diagnosis (CAD) system for breast cancer diagnosis using histopathological images. Irbm 40(4), 211–227 (2019).

Article MATH Google Scholar

Huang, Q., Ding, H. & Razmjooy, N. Oral cancer detection using convolutional neural network optimized by combined seagull optimization algorithm. Biomed. Signal Process. Control. 87, 105546 (2024).

Article Google Scholar

Raval, D. & Undavia, J. N. A comprehensive assessment of convolutional neural networks for skin and oral cancer detection using medical images. Healthc. Anal. 3, 100199 (2023).

Article MATH Google Scholar

Joseph, B. K. Oral cancer: prevention and detection. Med. Principles Pract. 11(Suppl. 1), 32–35 (2002).

Article MATH Google Scholar

Mira, E. S. et al. Early diagnosis of oral cancer using image processing and artificial intelligence. Fusion: Pract. Appl. 14(1), 293–308 (2024).

MATH Google Scholar

Myriam, H. et al. Advanced meta-heuristic algorithm based on particle swarm and Al-biruni Earth Radius optimization methods for oral cancer detection. IEEE Access. 11, 23681–23700 (2023).

Article MATH Google Scholar

Welikala, R. A. et al. Automated detection and classification of oral lesions using deep learning for early detection of oral cancer. Ieee Access. 8, 132677–132693 (2020).

Article MATH Google Scholar

Badawy, M., Balaha, H. M., Maklad, A. S., Almars, A. M. & Elhosseini, M. A. Revolutionizing oral cancer detection: an approach using aquila and Gorilla algorithms optimized transfer learning-based CNNs. Biomimetics 8(6), 499 (2023).

Article CAS PubMed PubMed Central Google Scholar

Jubair, F. et al. A novel lightweight deep convolutional neural network for early detection of oral cancer. Oral Dis. 28(4), 1123–1130 (2022).

Article PubMed MATH Google Scholar

Rahman, A. U. et al. Histopathologic oral cancer prediction using oral squamous cell carcinoma biopsy empowered with transfer learning. Sensors 22(10), 3833 (2022).

Article ADS PubMed PubMed Central MATH Google Scholar

Omeroglu, A. N., Mohammed, H. M., Oral, E. A. & Aydin, S. A novel soft attention-based multi-modal deep learning framework for multi-label skin lesion classification. Eng. Appl. Artif. Intell. 120, 105897 (2023).

Article Google Scholar

Wang, X. et al. Intelligent hybrid deep learning model for breast cancer detection. Electronics 11(17), 2767 (2022).

Article MATH Google Scholar

Li, Z. et al. Machine-learning-assisted spontaneous Raman spectroscopy classification and feature extraction for the diagnosis of human laryngeal cancer. Comput. Biol. Med. 146, 105617 (2022).

Article CAS PubMed MATH Google Scholar

Nasir, M. U. et al. IoMT-based osteosarcoma cancer detection in histopathology images using transfer learning empowered with blockchain, fog computing, and edge computing. Sensors 22(14), 5444 (2022).

Article ADS PubMed PubMed Central MATH Google Scholar

Uthoff, R. D. et al. Point-of-care, smartphone-based, dual-modality, dual-view, oral cancer screening device with neural network classification for low-resource communities. PloS One, 13(12), e0207493. (2018).

Xu, S. et al. An early diagnosis of oral cancer based on three-dimensional convolutional neural networks. IEEE Access. 7, 158603–158611 (2019).

Article Google Scholar

Rajaguru, H. & Prabhakar, S. K. Oral cancer classification from hybrid ABC-PSO and Bayesian LDA. In 2017 2nd International conference on communication and electronics systems (ICCES) (pp. 230–233). IEEE. (2017).

Chan, C. H. et al. Texture-map-based branch-collaborative network for oral cancer detection. IEEE Trans. Biomed. Circuits Syst. 13(4), 766–780 (2019).

Article PubMed MATH Google Scholar

Lin, H., Chen, H., Weng, L., Shao, J. & Lin, J. Automatic detection of oral cancer in smartphone-based images using deep learning for early diagnosis. J. Biomed. Opt. 26(8), 086007–086007 (2021).

Article ADS PubMed PubMed Central MATH Google Scholar

Bakare, Y. B. Histopathological image analysis for oral cancer classification by support vector machine. Int. J. Adv. Signal. Image Sci. 7(2), 1–10 (2021).

MathSciNet MATH Google Scholar

Rashid, J. et al. Mouth and oral disease classification using InceptionResNetV2 method. Multimedia Tools Appl. 83(11), 33903–33921 (2024).

Article MATH Google Scholar

Shivam, P. S. & Barot Oral Cancer (Lips and Tongue) images, (2020). https://www.kaggle.com/shivam17299/oral-cancer-lips-and-tongue-images

Yadav, D. P., Chauhan, S., Kada, B. & Kumar, A. Spatial attention-based dual stream transformer for concrete defect identification. Measurement 218, 113137 (2023).

Article MATH Google Scholar

Gupta, N., Garg, H. & Agarwal, R. A Robust Framework for glaucoma Detection Using CLAHE and EfficientNet1–14 (The Visual Computer, 2022).

Zhao, X. et al. A review of convolutional neural networks in computer vision. Artif. Intell. Rev. 57(4), 99 (2024).

Article MATH Google Scholar

Pacal, I. MaxCerVixT: a novel lightweight vision transformer-based Approach for precise cervical cancer detection. Knowl. Based Syst. 289, 111482 (2024).

Article MATH Google Scholar

Maman, A., Pacal, I. & Bati, F. Can deep learning effectively diagnose cardiac amyloidosis with 99mTc-PYP scintigraphy? J. Radioanal. Nucl. Chem., 1–16. (2024).

Pacal, I. A novel swin transformer approach utilizing residual multi-layer perceptron for diagnosing brain tumors in MRI images. Int. J. Mach. Learn. Cybernet., 1–19. (2024).

Huang, G., Liu, Z., Van Der Maaten, L. & Weinberger, K. Q. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4700–4708). (2017).

Zhang, X., Zhou, X., Lin, M. & Sun, J. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 6848–6856). (2018).

Xie, S., Girshick, R., Dollár, P., Tu, Z. & He, K. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1492–1500). (2017).

Tan, M. et al. Mnasnet: Platform-aware neural architecture search for mobile. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 2820–2828). (2019).

Hörst, F. et al. Cellvit: Vision transformers for precise cell segmentation and classification. Med. Image. Anal. 94, 103143 (2024).

Article PubMed MATH Google Scholar

Pacal, I., Celik, O., Bayram, B. & Cunha, A. Enhancing EfficientNetv2 with global and efficient channel attention mechanisms for accurate MRI-Based brain tumor classification. Cluster Comput., 1–26. (2024).

Liu, Z. et al. A convnet for the 2020s. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 11976–11986). (2022).

Liu, Z. et al. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 10012–10022). (2021).

Mehta, S. & Rastegari, M. Mobilevit: light-weight, general-purpose, and mobile-friendly vision transformer. arXiv Preprint arXiv:211002178. (2021).

Howard, A. et al. Searching for mobilenetv3. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 1314–1324). (2019).

Piyarathne, N. S. et al. A comprehensive dataset of annotated oral cavity images for diagnosis of oral cancer and oral potentially malignant disorders. Oral Oncol. 156, 106946 (2024).

Article CAS PubMed MATH Google Scholar

Download references

This research received no external funding.

Department of Computer Engineering & Applications, GLA University Mathura, Mathura, India

Dhirendra Prasad Yadav

Centre for Research Impact & Outcome, Chitkara University Institute of Engineering and Technology, Chitkara University, Rajpura, Punjab, 140401, India

Bhisham Sharma

Department of Computer Science and Engineering, Manipal University Jaipur, Jaipur, Rajasthan, India

Ajit Noonia

Department of Electronics and Communication Engineering, Kuwait College of Science and Technology (KCST), Doha Area, 7th Ring Road, Kuwait City, Kuwait

Abolfazl Mehbodniya

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

Conceptualization, Dhirendra Prasad Yadav, Bhisham Sharma, Ajit; Data curation, Dhirendra Prasad Yadav, Bhisham Sharma, Ajit; Formal analysis, Abolfazl Mehbodniya; Investigation, Abolfazl Mehbodniya; Methodology, Dhirendra Prasad Yadav, Bhisham Sharma; Project administration, Ajit; Resources, Abolfazl Mehbodniya; Software, Bhisham Sharma; Visualization, Ajit, Bhisham Sharma; Writing – original draft, Dhirendra Prasad Yadav, Bhisham; Writing – review & editing, Ajit, Abolfazl Mehbodniya.

Correspondence to Bhisham Sharma.

The authors declare no competing interests.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

Yadav, D.P., Sharma, B., Noonia, A. et al. Explainable label guided lightweight network with axial transformer encoder for early detection of oral cancer. Sci Rep 15, 6391 (2025). https://doi.org/10.1038/s41598-025-87627-y

Download citation

Received: 03 November 2024

Accepted: 21 January 2025

Published: 21 February 2025

DOI: https://doi.org/10.1038/s41598-025-87627-y

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative