Malaria Parasite Detection using Efficient Neural Ensembles

Caused by the bite of the Anopheles mosquito infected with the parasite of genus Plasmodium, malaria has remained a major burden towards healthcare for years, with approximate 400,000 deaths reported globally every year. The traditional diagnosis process for malaria involves an examination of the blood smear slide under the microscope. This process is not only time-consuming but also requires pathologists to be highly skilled in their work. Timely diagnosis and availability of robust diagnostic facilities and skilled laboratory technicians are very much vital to reduce the mortality rate. This study aims to build a system that help in timely and accurate diagnosis of malaria which would help in reducing the mortality rate and eventually help is attaining a malaria free environment. Applying deep learning techniques such as transfer learning and snapshot ensemble to automate the detection of the parasite in the thin blood smear images. Snapshot ensemble, a technique to create better performing ensembles with a limited training budget. Instead of training multiple models, snapshots are recorded during the training phase, which are later ensembled to create one strong model. All the models were evaluated against the following metrics - F1 score, Accuracy, Precision, Recall, Mathews Correlation Coefficient (MCC), Area Under the Receiver Operating Characteristics (AUC-ROC) and the Area under the Precision Recall curve (AUC-PR). The snapshot ensemble model created by combining the snapshots of the EfficientNet-B0 pre-trained model outperformed every other model achieving a f1 score - 99.37%, precision - 99.52% and recall - 99.23%. The results show the potential of model ensembles which combine the predictive power of multiple weal models to create a single efficient model that is better equipped to handle the real-world data. The GradCAM experiment displayed the gradient activation maps of the last convolution layer to visually explicate where and what a model sees in an image to classify them into a particular class. The models in this study correctly activate the stained parasitic region of interest in the thin blood smear images. Such visuals make the model more transparent, explainable, and trustworthy which are very much essential for deploying AI based models in the healthcare network.


I. INTRODUCTION
Malaria, a common disease which poses life threatening risks is mostly prevalent in the tropical and sub-tropical regions of the world. One of the oldest diseases known to mankind is caused by the bite of the Anopheles mosquito that acts as the vector for parasite transmission. The blood sucking event injects the parasite present in its saliva into the person's blood stream. Studies show that the parasites trick the liver to support their growth and replication [1]. The parasites first multiply in the liver and later conquer the blood stream. Once the parasites are transmitted to the blood stream they multiply and invade the red blood cells, eventually killing them. The transmission of parasites between the vectors and the humans is a cyclic process in which the blood sucking event from an already infected person transmits parasites back into the mosquito where they multiply and grow again. Thus, a cyclic process is The detection of the parasite in the blood stream of the infected person is done using microscopic examination with infected person's blood specimen spread on a slide either as a thick or thin smear, stained with the staining agent to segment out the blood cells for examination under the microscope by highly qualified pathologists. Thick blood smears aid in detecting the presence of malarial parasites while the thin smears aid in parasite species identification and quantification [7]. However, this is a very time consuming and exhaustive process which requires pathologists to be highly skilled in their job. Also, in rural and remote regions that do not have good medical facilities, a timely diagnosis is a challenge which could affect the patient's health condition and may even prove fatal at times. Regions where malaria is still categorized as an endemic is either due to the absence of proper health care facilities or due to the shortage of skilled pathologists [8]. United Nation and the Bill & Melinda Gates Foundation target to eradicate malaria globally by 2040 [9] while India is eyeing to attain the status of malaria free nation by the year 2030 as part of the Government of India's initiative -National Framework for Malaria Elimination In India [10]. To achieve the target, there needs to be robust systems available and accessible globally which aim for quick and prompt diagnosis. Such systems will help in reducing the mortality rates and consequently contribute towards eradicating malaria.
There have been numerous research activities towards detection of malarial parasites using computer vision, machine learning and deep learning techniques. The computer vision techniques involve segmentation of erythrocytes and feature extraction which are then fed to a classification algorithm. However, extracting good features is a challenging task and requires expert level domain knowledge. Recent development in the deep learning systems have shown to be able to extract and learn the relevant features automatically. Convolutional Neural Networks, Transfer Learning and Model Ensembles are some application of deep learning systems which stand out in image classification tasks performing at par with the human level intelligence.
The objective of this study is to build a deep learning system that help in timely and accurate diagnosis of malaria using a discounted ensemble building technique known as Snapshot Ensemble that assist in building strong learners at the cost of training only a single deep learning model. The rest of the paper is organized into the sections of literature review which discuss the methods applied in previous studies related to malaria classification, material and methods section which discusses the methods applied in this study, results and discussion section which analyzes the results obtained in the experiments of this study and compares them with the previous studies and finally the conclusion section which summarizes the study and mentions the scope for future work.

II. LITERATURE REVIEW
Image processing, and machine learning based algorithms have been used extensively for detection of malarial parasites. Machine learning algorithms require hand crafted features to be fed as input. Feature extraction can be a complicated task that requires an expert level domain knowledge and is also error prone. Various image processing methods are applied to remove unwanted information, enhance the quality, and extract relevant pixel-based features from the images. However, deep learning removes the burden of manual feature extraction, and has turned out to be the preferred choice for malaria classification in the recent times.
Image processing techniques like filtering, contrast stretching, segmentation, thresholding, and morphological operations have been applied to enhance the image quality and extract features. Filtering techniques have been widely used to remove any noise present in the whole slide images. Di Ruberto et al. [11] apply a (5x5) median filter for smoothening and noise removal. Díaz et al. [12] utilized the low-pass filter to get rid of the noisy components from the slide images. Anggraini et al. [13] applied median filter for noise removal. May et al. [14] implemented median filtering technique to get rid of any salt-and-pepper noise in the image and also suggested the use of Weiner filter to get rid of any gaussian noise in the image. Savkare and Narote [15] recommend application of Laplacian filtering technique for image smoothening and edge enhancement. Gaussian lowpass filter has been implemented by Arco et al. [16]. Dong et al. [17] apply filtering techniques as part of the image preprocessing phase to avoid learning of irrelevant information by the neural network. Loddo et al. [18] discuss the application of morphological filtering techniques for noise removal.
Anggraini et al. [13] apply contrast stretching to boost the properties of the parasite objects. May et al. [14] applied histogram stretching to enhance the contrast of the images. Histogram Equalization is applied by Savkare and Narote [15] to enhance contrast along with Partial Contrast Stretching, a linear mapping which works on the range of the average of the maximum and minimum red, green, blue intensity values. Adaptive Histogram Equalization is applied by Arco et al. [16] for localized contrast corrections to enhance the region of interest. A wide range of contrast correction techniques have been implemented to enhance the images but the partial contrast stretching, and adaptive histogram equalization prove to perform better on the pathological images.
Di Ruberto et al. apply simple thresholding technique to segment the region of interest and apply morphological closing to separate the white blood cells from the red blood cells [11]. Morphological difference is analyzed for pattern recognition and classification of parasites. Anggraini et al. [13] and May et al. [14] applied Otsu's algorithm to determine the objects of interest. A two-stage color segmentation technique to identify the region of interest and remove the white blood cells is applied by Prasad et al. [19]. Khan et al. [20] take the unsupervised learning approach and apply K-means clustering multiple times to segment the infected erythrocytes from other blood components. Savkare and Narote [15] and Bairagi and Charpe [21] apply Otsu's and Watershed algorithm with morphological opening to separate the overlapping red cells. Arco et al. [16] apply the adaptive threshold mechanism along with morphological closing to improve the segmentation of objects of interest. Dong et al. [17] applied thresholding with Hough Circle transformation to separate overlapping erythrocytes. GI et al. [22] applied a two-staged segmentation process involving adaptive thresholding followed by watershed algorithm to detach the overlapping red cells. Pan et al. [23] apply Otsu's algorithm followed by morphological closing to remove the irrelevant information. Nag et al. [24] apply watershed algorithm to separate the overlapping objects.
The degree of success for any machine learning system depends on the quality of features fed to the algorithm. Extracting the best features requires an expert-level domain knowledge and understanding the morphology of various blood components and the parasite life cycle. Various set of features have been extracted to solve the problem and have enjoyed a varied level of success. Savkare and Narote [15] extract textual and morphological features based on mean, standard deviation. kurtosis, area, perimeter, major/minor axis of the erythrocytes. Bairagi and Charpe [21] apply Gray-Level Co-occurrence Matrix (GLCM) to extract the textual features like the contrast, pixel's correlation with the neighbors and energy (sum of the squares of the elements in the GLCM matrix). Park et al. [25] extract features based on morphology, geometry as elongation, equivalent diameter, eccentricity and statistical features of skewness, kurtosis.
Bashir et al. [26] extract features related to skewness based on the pixel intensity values. Dong et al. [17] use the Kullback-Leibler distance to determine the best features. Bibin et al. [27] used histogram-based features, color coherence vector and textual features like the Haralick features derived from the GLCM matrix, Local Binary Pattern features and Gray Level Run Length Matrix features to train the system. Gezahegn et al. [28] utilize the Scale Invariant Feature Transformation to create a bag-of-features. Nag et al. [24] extract features based on texture and morphology to filter the white blood cells.
The extracted features are fed to the algorithms which try to learn the relations and representations of the features that enable to detect the presence of parasites. Many algorithms have been implemented relishing a varied level of success. Ross et al. [29] implemented a two-stage backpropagation feed-forward network system classifying the erythrocytes as infected and not infected in the 1 st stage and classify the parasite species in the 2 nd stage achieving a sensitivity of 85% and positive predictive value of 81%. Tek et al. [30] propose a generic sliding window technique on top of the Ada-Boost algorithm. Anggraini et al. developed an algorithm completely based on image processing and thresholding techniques to detect the parasite in the erythrocytes [13]. May et al. [14] propose a segmentation based system to segment out the infected cells and achieve the positive predictive value of 98.90%. Khan et al. [20] take the unsupervised learning approach and implement the Kmeans clustering algorithm to extract two clusters -the malarial parasite infected cells and the other components achieving a precision of 0.95 and recall of 0.93. Savkare and Narote [15] implement the Support Vector Machine with the Radial Basis Function kernel in a two stage classification system. The correct identification rate of the binary classifier being 99.43% while the SVM classifier achieves an accuracy of 96.42% in identifying the life cycle stage of the parasite. Park et al. [25] conduct several experiments based on the Linear Discriminant Classification, Logistic Regression, and K-Nearest Neighbors algorithms achieving a classification accuracy greater than 95%. Bairagi and Charpe [21] apply the SVM algorithm and achieve an accuracy of 97.7%. Gezahegn et al. [28] proposed SVM-RBF based classification system. The system performs sub-optimally achieving an accuracy of 78.89%. Nag et al. [24] conduct multiple experiments based on K-Nearest Neighbors, Naïve Bayes, and Support Vector Machine with RBF kernel. The SVM model gives the best results with an accuracy of 97.59%. An ensemble of the above 3 models is created which achieves an accuracy of 98.74%.
The deep learning framework eliminates the need to extract hand-crafted features. Liang et al. [31] propose a 17layer custom CNN model and a transfer learning approach utilizing the pre-trained AlexNet [32] as feature extractor linked to the SVM classifier for final classification achieving a classification accuracy of 97.37% and 91.99% respectively. Dong et al. [17] evaluate the performance of pre-trained deep neural models like LeNet, AlexNet and GoogleNet. GoogleNet outperformed others owing to its greater depth and thus the ability to extract more detailed features. Bibin et al. [27] implement a deep belief network with 4 hidden layers each trained as a Restricted Boltzmann Machine to distinguish the stained objects. The system outperforms most state-of-the-art models with a small error rate of 0.0379. GI et al. [22] utilize the focus stack of images on a CNN model achieving sensitivity of 97.06%, specificity of 98.50% and Matthews Correlation Coefficient of 0.7305. Pan et al. [23] evaluate the effect of data augmentation on pre-trained LeNet-5 model and achieve an accuracy above 90%. Rajaraman et al. [33] assess the performance of pretrained state-of-the-art models on malaria images. Features were extracted from various layers of these models and observed that the final layer of the models does not always provide optimal features. ResNet-50 outperformed others achieving an accuracy of ~95%. Hung et al. [34] propose a model based on faster region based network where convolution is done only once per image. A Region Proposal Network proposes various object regions with bounding boxes and the Object Detector based on the AlexNet performs the final classification of the objects as schizonts, trophozoites achieving an accuracy of 98%. A lightweight CNN model composed of 11 layers is proposed by Yang et al. [35] achieves an average accuracy of 93.46%. Rahman et al. [36] perform transfer learning experiments with VGG-16 pre-trained model achieving a test accuracy of 97.77% and a hybrid model combining CNN and SVM. The ensemble of the above models achieves an accuracy of 97.78%. Rajaraman et al. [37] propose ensemble of deep networks with various transfer learning models acting as the weak learners. The ensemble of VGG-19 and SqueezNet outperforms every other model with an accuracy of 99.51%. Vijayalakshmi et al. [38] take the amalgamated training approach by combining the feature extraction power of the VGG-19 and the classification power of the SVM achieving an accuracy of 93.13%.
An extensive literature survey helps to draw out the following observations on the various techniques applied on the whole slide images-1. Median filters are the most widely used for noise removal. 2. Otsu's and Watershed algorithm followed by morphological operations in widely used for segmentation of various components and mark the region of interest. 3. Various features based on pixel, texture, spectral, color, statistics, etc. have been extracted to enhance the pattern identification in the cellular components. 4. SVM with RBF kernel has been widely used for malaria classification. However, the entire process of feature extraction could be error-prone and since all the tasks in the pipeline are sequential, any fault that may have happened in the initial stages, gets forwarded to the model resulting in an inaccurate diagnosis. Recent studies have started to explore deep learning which removes the process of manual feature extraction. This saves time and minimizes the need to have an expert-level domain knowledge. The lack of labelled data is avoided through the data augmentation techniques.

1) DESCRIPTION
The dataset to be used for this study is taken from the National Library of Medicine, part of the National Institute of Health [39]. The dataset is an archive of red blood cells segmented using the Giemsa stained slides. The samples are taken from 150 infected and 50 uninfected persons at the Malaria screener research activity in CMC hospital in Bangladesh. A total of 27,558 erythrocyte images comprising of equal instances (13779) from each category. FIGURE 1 shows a few data samples from the parasitized and non-parasitized category.

2) DATA ANALYSIS
Bio-medical images are very diverse. For a similar pathological condition, images could vary invariably from person to person and even for the same person for different encounters. These differences may be attributed to the variation in lighting conditions, difference in marker stains (for pathological tests), image extraction process, image dimension, etc. Image Preprocessing ensures that all the images are in the same standard format and clean from any noise that would not add any relevance for the analysis. In this study, all images are resized to a dimension of (135, 135) and every image goes through the normalization and standardization process centering the pixel values around the mean to ensure the faster convergence in the training phase. This will result in a simple, accurate and a robust classification system.

3) DATA SPLIT
Since deep learning systems require huge amount of training data to learn all the underlying image patterns and representations, the dataset is split in the ratio 75:15:10. The training set gets 75% of the data while the test set and the validation set get 15% and 10% of the data respectively.

B. SYSTEM DESIGN AND ARCHITECTURE
Convolution Neural Network (CNN) form the base of all the models and experiments to be conducted. An image in its raw format translates to an array of pixel values. The neighboring pixels are highly correlated and usually form the basis of feature extraction. CNNs exploit this correlation by applying the techniques like local receptive fields, weight sharing, pooling and use of multiple layers making the entire architecture deep.

1) PROGRAMMING RESOURCES
Programming task in the study is done on a cloud based system with CUDA enabled Nvidia Tesla K80 GPU, 4 core CPUs, 20 GB RAM. The programs are written in Python 3.6 using the web-based Anaconda Jupyter environment. The deep learning models are created using Keras library with Tensorflow 2.2 backend enabled with GPU acceleration. All the image processing and computer vision tasks are carried out using the Open Source Computer Vision (OpenCV) library.

2) CUSTOM MODEL
A custom model with a total of 13 layers is designed having 3 convolutional layers with 32, 64, and 128 filter units respectively, 2 max-pooling and a Global Average Pooling layer is designed from scratch. Each convolutional unit is ReLU activated with Batch Normalization. Filters of size (3x3) with padding set to 'same' convolves the input. The Max Pool layers have pool window of (2x2). The final classification layer is a dense layer with one unit and sigmoid activation. The model design is shown in FIGURE 2.

3) TRANSFER LEARNING
CNNs are basically feature extractors that learn the representation of an image. The state-of-the-art CNN models have learnt to extract features from millions of images. The skills of these models can be reused and applied to solve a related problem known as transfer learning. The main reasons to reuse the knowledge of the pre-trained models are lack of proper annotated data and computational resources. The lack of proper annotated bio-medical images makes transfer learning an important aspect when applying deep learning in the field of digital pathology. The initial layers of a CNN extract the generic features (corners and edges), the middle layers extract abstract features by aggregating the corners and edges, while the last few layers are utilized for classification of images based on the features extracted. The final layers of the pre-trained models might not be useful in classifying the pathological images and could be removed. A smaller learning rate is chosen to ensure all the previous knowledge of the base model is retained and reused. State-of-the-art architectures like ResNet50-V2, DenseNet-121, Inception-v3, Xception, InceptionResNet-V2, and EfficientNet have attained optimal performance by training over millions of images and will be utilized for transfer learning. FIGURE 3 shows a transfer learning model implementing EfficientNet-B0 network as the convolutional base.
The dataset consists of only 13779 images of each category respectively. Deep learning systems need huge amount of data to learn all possible representations and be able to perform optimally in the real-world environment. To increase the diversity and amount of training data, various augmentation techniques like flipping, rotation, cropping, translation, illumination, scaling, shift, and zoom are applied.

4) SNAPSHOT ENSEMBLE
The overhead cost of training multiple deep neural networks could be very high in terms of the training time, hardware, and computational resource requirement and often acts as obstacle for creating deep ensembles. To overcome these barriers Huang et al. [40] proposed a unique method to create ensemble which at the cost of training one model, yields multiple constituent model snapshots that can be ensembled together to create a strong learner. The core idea behind the concept is to make the model converge to several local minima along the optimization path and save the model parameters at these local minima points. During the training phase, a neural network would traverse through many such points. The lowest of all such local minima is known as the Global Minima. The larger the model, more are the number of parameters and larger the number of local minima points. This implies, there are discrete sets of weights and biases, at which the model is making fewer errors. So, every such minimum can be considered a weak but a potential learner model for the problem being solved. Multiple such snapshot of weights and biases are recorded which can later be ensembled to get a better generalized model which makes the least amount of mistakes. The training is then restarted by increasing learning rate. The idea in increasing the learning rate is to take a big step and escape the local minimum. The learning rate follows a cyclic pattern as suggested by Loshchilov and Hutter [41] where the learning rate is quickly raised and then lowered to follow the cosine pattern as seen in FIGURE 5.

FIGURE 5. Cosine Annealing
Cosine Annealing proposed in [41] is used to decay the learning rate is defined as - where, ∝(t) is the learning rate at epoch t, ∝0 is the maximum learning rate, T is the total number of epochs, M is the number of cycles, mod is the modulo operation, and square brackets indicate a floor operation.
At the end of full training cycle, there will be 'M' different models available ready to be ensembled with no additional overhead cost. During the test time, "the last 'm' model snapshots are always considered for the final ensemble as these tend to have the least error rate". The final output is the average of the snapshot models and is calculated as per the following equation defined in [40].
where, x is the sample test set, hi(x) is the softmax output score of the i th snapshot.
For this experiment, (1) a custom model with 4 convolution layers as seen in FIGURE 6 is implemented to create snapshots and (2) a snapshot ensemble model to be created out of the best performing transfer learning model.

5) GRAD-CAM -VISUAL EXPLANATIONS
The application of AI systems in healthcare domain is a challenging task mainly because the factors involved in arriving at a decision by the machines are not explainable. Questions like, how did the machine arrive at the decision? or what did the machine see to classify the blood smear as infected by malaria parasite? will always be asked to understand the machine's thought process to arrive at decisions in healthcare based AI systems. To understand the dynamics of deep networks Selvaraju et al. [42] proposed a technique to visually explain an AI system's decision and make it more transparent. The Grad-CAM technique explains a model's decision making process answering, "why they predict what they predict". Grad-CAM relies on convolutional layer as they tend to retain the spatial information. It utilizes the gradient activations coming into the last convolutional layer as these layer contain more class specific features. The gradient score of the concerned class is calculated with respect to the feature map activation of the convolution layer given by the formula - Once the gradient scores are calculated, ReLU is applied to the weighted linear combination of the forward activation maps to consider only those features that have a positive impact on the class of interest.
The Custom Model in FIGURE 2, will be utilized to visualize the internal dynamics of what a model sees to arrive at the decision.

6) HYPERPARAMETERS & CALLBACKS
All the model experiments will be implemented using the hyperparameters and call backs mentioned in TABLE 1

IV. RESULTS AND DISCUSSIONS
Several experiments were conducted with custom models, transfer learning and snapshot ensemble strategies. Custom model experiment was extended to analyze the network gradients and understand the dynamics of the model's approach in arriving at a decision. (Note: all the evaluation metrics are measured in percentage.)

A. CUSTOM MODEL
The custom model is set to train for 50 epochs with an early stopping strategy after 7 epochs. The model trained for 34 epochs before being early stopped. The training and validation history is seen in FIGURE 7. The model converges well enough on training and validation data. The performance metrics on the test data of the model is summarized in TABLE 2. The model achieves an accuracy of 96.95% which is reasonably well for the baseline model. However, the number of false negatives is 81 which is quite high for a healthcare model for identifying disease. False negatives occur when a person having malaria is declared healthy. This will hamper the timely diagnosis, treatment for the patient and may have fatal consequences. The target is to reduce the number of false negatives.

B. GRAD-CAM VISUAL EXPLANATIONS
The custom model from the previous experiment was extended to visualize the gradients that help a model come to a decision. Visuals like these will help in designing explainable AI systems which are able to provide meaningful explanations as to why a particular decision was made? Such developments and capabilities would make the adoption of AI into the healthcare domain realistic and smoother.  FIGURE 9. The evaluation metrics obtained on the test data is summarized in TABLE 3. EfficientNet-B0 outperformed every other state-of-the-art model and achieved the highest F1 score of 97.95% followed by the DenseNet121 with an F1 score of 97.81%. EfficientNet-B0 records the least number of false negatives.

D. SNAPSHOT ENSEMBLE
The ensemble experiment was performed on two different models to test the effectiveness of the ensembles and ensure that ensembles perform better than the single model. One, a custom model represented by FIGURE 6 and second, the EfficientNet-B0 model since it turned out to be the best performing model in the transfer learning experiment.

1) CUSTOM MODEL SNAPSHOTS
The custom model was configured to utilize cosine annealing cyclic learning rate scheduler. The model was set to train for 50 epochs and the number of cycle was set to 5 since the ensembled model is most effective when the snapshots are extracted far apart during the training phase. The snapshots were recorded each at the end of 10th, 20th, 30th, 40th and 50th epochs respectively. Each of the snapshot was evaluated on the test data and the results are summarized in TABLE 4. The evaluation metrics gradually increases towards the final snapshots while the number of false predictions decrease steadily. Ensembled models were created by combining the snapshots. As mentioned in [40], "at the test time, the last 'm' models are always considered for the final ensemble as these tend to have the least error rate." snapshots 4 and 5 with test time loss of 6.06% and 5.98% respectively were always considered for creating the ensembled models.
Three different combinations of the snapshots were combined to create the ensembles. Ensemble 1 combined every snapshot [1 to 5], Ensemble 2 combined snapshots [3,4,5] and Ensemble 3 combined snapshots [4,5]. Each of the ensembled model was evaluated on the test data and the results are summarized in TABLE 5. Ensemble 3 outperforms the other ensembles and the individual model obtained at the end of 50 epochs. Ensemble [4,5] achieves a f1 score of 98.44% which is an increase of around 2.9% from the fully trained single model. Snapshots 1 and 2 do not contribute enough towards the ensemble performance. This could be accounted to a relatively higher loss as the model is still learning during the initial cycles of the training phase. Snapshots 3, 4 and 5 have much of the contribution to the ensembled model.   [3,4,5], Ensemble 4 combined snapshot [3,4], and Ensemble 5 combined snapshot [4,5]. Each of the ensembled model was evaluated on the test data and the results are summarized in TABLE 7.

E. INTERPRETING PREDICTION SCORE
The model prediction on a test image gives the probability score for the cell being infected with the parasite. There are two samples in FIGURE 10 with infection probability of 20.70% & 29.40% respectively. In both images a few   FIGURE 11 shows a few samples with probability score and infection severity.

F. FALSE PREDICTION ANALYSIS
All the experiments conducted report some amount of false predictions as false positives and false negatives. An analysis to verify these false predictions and examine a possible reason for false reports was conducted. The Snapshot Ensembled [4,5] model in  12 and FIGURE 13 show all the misclassified samples with their ground truth. As seen in the false positives, few samples clearly indicate the presence of the parasite after staining. However, they are labelled as 0 (uninfected). Similar pattern is seen in case of false negatives where samples that do not show presence of parasite after staining being labelled as 1 (parasitized). Thus, there is some amount of error in the data annotation phase. Since the data is labelled manually, errors could happen. Correcting the labels and re-training the models could help to improve the model classification performance.

G. RESULT COMPARISON
The results obtained in this study are compared with the results of the previous work that have been conducted towards malaria classification and discussed as part of literature review. The comparison is shown in TABLE 9.
The EfficientNet snapshot ensemble model outperforms most of the previous work is all aspect and achieves comparable result with the work of Rajaraman et al. [37] who also experiment the efficiency of neural network ensemble. However, a major improvement of this work is attributed to the fact that snapshot ensemble creates better performing ensembles by training only a single model. This not only saves training time but also creates ensembles with a limited training budget in terms of hardware resources and computational power. The malaria detection system developed in this study will help in the accurate and timely diagnosis of malaria. The system can be deployed in remote regions which do not have access to skilled and experienced pathologists. Timely diagnosis can help design a smooth recovery path and eventually bring down the mortality rate. The integration of an explainable AI module via the GradCAM technique will offer an added advantage to encourage the usage of AI systems not only in the healthcare experts but also the patients.
However, this study is currently limited to the detection of only the P. falciparum parasite. Malaria can be caused by five different species of the parasite. The system once extended to support all the five species in a multi-class classification approach would become a complete end-toend malaria detection system.

V. CONCLUSION
A series of experiments were conducted as part of this study building end-to-end deep learning systems for diagnosing malaria from thin blood smear whole slide images. Several custom network architectures were designed along with transfer learning approach and snapshot ensemble technique. The healthcare domain where lack of proper annotated data is often considered as a limitation, transfer learning using state-of-the-art models which have a good knowledge base and serve as an efficient alternative to designing high performance deep learning models for the healthcare domain. In the transfer learning experiment, EfficientNet-B0 model turned out to be most efficient with a f1 score of 97.95% and MCC 95.89%. The ensemble learning implemented a relatively new approach of snapshot ensemble where multiple snapshots of the model are recorded during the training phase. Every snapshot is different from the other as they tend to converge at different local minima points recording unique set of weights and biases. Two different model architectures -a custom model and the Efficientnet-B0 model were setup to extract five distinct snapshots respectively. Many different ensembles were created using various combination of the snapshots from each architecture. Each ensembled model proved to be stronger than the single model. The EfficientNet-B0 ensemble with the final two snapshots gave the best classification scores achieving a f1 score of 99.37% and MCC score of 98.74%. This model also has the least number of false predictions meaning it generalizes well on unseen data and proves to be a great model in developing better clinical solution for malaria detection. Also, the fact that we get ensembled model by training only one model not only removes the need of high efficiency computational resources but also saves valuable amount of time required to train multiple large models to combine and create ensembles. The GradCAM experiment shows where exactly a model looks in the image to arrive at a decision. It is observed that the model activates the region around the parasite in the infected cells and utilizes it to differentiate the parasitized cells from unparasitized cells. The GradCAM visualizations make the models more transparent, explainable, and trustworthy which are very much essential for deploying AI based models in the healthcare network.
This study considers only the P. falciparum infected cells for malaria classification. Since, there are five different species of the malarial parasite namely, P. falciparum, P. malariae, P. ovale, P. vivax and P. knowlesi. The experiments conducted in this study can be extended to detect the other species of the parasite which can be taken up as the future work. Another important factor to consider the adoption of AI systems into healthcare practices is testing. Validating the model with external datasets i.e., samples from a different set of population or distribution would give the most correct measure for generalizability of the model. The model can be fine-tuned on the new population accordingly.