Comparison of Machine Learning Algorithm For Urine Glucose Level Classification Using Side- Polished Fiber Sensor

Article Info Abstract Article History: Received June 22, 2020 Revised July 13, 2020 Accepted July 18, 2020 Urine glucose levels can be used to determine if glucose levels in the human body are too high, which may be a sign of diabetes. A non-invasive urine glucose classification model was conducted by using of the color of urine after benedict reaction to measure the level of glucose. The aim of this study is to classification urine glucose levels from a side-polished fiber sensor performed by using machine learning algorithms to get the best algorithm performance. By removing the coating and cladding this sensor is made of a polymer optical fiber. The measurement is focused on changes in the cladding refractive index which affects the amount of light transmitted. The machine learning system has been implemented using the Naïve Bayes Classifier, k-Nearest Neighbor Classifier, Logistic Regression, Random Forest, Artificial Neural Networks and Support Vector Machine. The measurement data on samples were collected from previous studies of a total of 120 urine samples for testing in this study. The results of the experiments performed with k-fold cross validation show that the neural network gets the accuracy results of 96.7%, the value of precision 0.967, recall 0.967, and F1-Measure 0.967. With cross validation leave-one-out, the experimental results show the classification algorithm with the best accuracy value that is at the random forest and artificial neural networks 0.975, precision 0.975, recall 0975, and F1Measure 0.975. While the ANN algorithm is superior in achieving an accuracy value of 98.6%. Therefore, artificial neural networks are the best method for classifying glucose levels in the human body for fasting and postprandial urine tests.


I. INTRODUCTION
Diabetes mellitus is a condition triggered by the pancreas' failure to contain adequate quantities of insulin. Diabetes is a condition in which the body is losing the capacity to regulate blood glucose at a stable amount. Diabetes mellitus, according to the American Diabetes Association [1], is a congenital disease due to increased blood glucose that can damage blood vessels and nerve tissues. Therefore, patients with diabetes should control their glucose levels with external devices. Most of the current methods used to measure the glucose level are invasive. These methods rely on the extraction of blood by pricking with a needle to the body or finger and cause pain in patients [2]. The invasive method is a technique that has a procedure to injure a patient using a needle and took this blood to measure glucose contained in the blood. The weakness of this method results in pain in the patient's body and the psychological effects on patients who have a fear of needles and syringes. Whereas, the non-invasive method using bio-fluids technique by taking a body fluid such as saliva, tear, sweat, and urine so that patients feel comfortable using this technique [3]. Urine is one of the objects that can be used in detecting disease in the human body. Urine dissolved with Benedict's solution can show color change from clear blue to brick red. Different urine colors become parameters that can be taken to detect the condition of the human body. Urine color is dependent on the level of glucose in the blood of patients, such as diabetes mellitus One method that is used to detect glucose is by polymer optical fiber sensor. Fiber optic is a device that works as a transmission medium of light waves with a cylindrical shape. Polymer optical fiber has several advantages such as wide bandwidth, high sensitivity, operating in visible light and safe using a laser light source [4]. Photo-detector or photodiode is used to measure the intensity of light [5]. Fiber optic sensors can be used as a measurement of the concentration of the solution, and in the medical field as a sensor of sodium chloride solution [6], and as a blood glucose measurement sensors [7] [8].
To identify diabetes mellitus, it can be seen from the level of glucose of patients through a variety of laboratory test results. The identification results have a discrete value that can be categorized. Classification method is used to define the glucose level in the body of patients based on the color of the urine from the results of the sample test. Classification is a method for predicting the category or class of an item or data that has been defined previously. One method of classification that is known is a method of Naïve Bayes classifier. Previous studies have compared the clinical outcomes in early detection of diabetes classification diabetes mellitus using Naïve Bayes classifier analysis method produces a classification accuracy of 82.30% [9]. Additionally, Karegowda conduct testing of K-means clustering of multi-level and k-Nearest Neighbor together for diabetes dataset obtain a classification accuracy of 96.68% [10]. Researchers conducted by Omurlu [11] by using urine samples dataset that compares the performance of a logistic regression to predict diabetes mellitus type 2. The results are capable of producing the best classification accuracy of 84.85% and a sensitivity of 68.0%.
The artificial neural network is a popular algorithm applied to machine learning for classification and regression tasks. Research on non-invasive measures of the concentration of glucose present in the urine sample by Geetha to automate classification by color using an artificial neural network, of the linear relationship obtained approximately 96.93% accuracy [12]. Machine learning algorithms such as support vector machines (SVM) practically proposed to boost the performance of the predictive classification models for the selection of clinical datasets [13]. This effort was made for the application of machine learning in diabetes mellitus research.
Based on these issues, the objective of this study was to compare and evaluate the predictive models of the machine learning algorithm for classifying urine glucose levels using a side-polished fiber sensor. Almost all of the research that has been done does not use more than two machine learning and using a spectrometer sensor device is relatively complex. This study focused on the main problems to obtain results of the comparison and analysis of the accuracy of the classification method Naïve Bayes Classifier, k-Nearest Neighbor Classifier, Logistic Regression, Random Forest, Artificial Neural Networks and Support Vector Machine for classification determination of urine glucose level form the samples. Through this research is expected to contribute the results of a comparative analysis of information and the accuracy of classification methods in further investigation of diabetes mellitus.

A. Materials and Tools
In general, optical fiber consists of three parts: core, cladding, and coating. A core is the main part in guiding light waves which have the refractive index. The cladding serves to reduce the scattering loss of the core surface and protects the fiber from contamination of surface absorption. One of the optical fiber sensors that have been developed is evanescent sensor where the working principle of sensors based on the evanescent wave effect. Evanescent sensors made by polishing the original cladding of optical fiber so that the value of the refractive index change [14]. For order to infiltrate the evanescent area of the optical fiber, the polymer fiber is polished after removing part of cladding the fiber surface. Evanescent areas will amplify light energy that passes through the fiber optics as a difference in the index of refraction. This system utilizes 1-millimeter polymer fiber optics, which is exposed to the sample in the core.
The tools used include tri-color LED, a light detector in the form of OPT 101, voltage amplifier circuit, power supply circuit, microcontroller minimum system board ATMega8535, the liquid crystal display 2×16, serial port monitor software and optical fiber FD-620-10. Optical fiber used in this research is a multimode FD-620-10, because it has a large enough power losses thus suitable for use as optical fiber sensor. Fiber optic cable was cut 10 cm. In the middle of the cable, the coating must be removed by using a cutter along with the 5 mm, then the cladding peeled using acetone solution. After that, the cladding is polished to ensure no cladding layer is left and clean. Monolithic photodiode OPT 101 is used to measure light intensity with a regulated 5v electrical power. The voltage is produced by photo detector as an output signal, which will then be increased by the volt amplification circuit. The voltage signal of the amplifier is linked through with a controller to an analog port. The signal is then transformed to digital data through an analog-digital converter program. 10-bit ADC of the microcontroller converts the light detector output voltage signals to digital form in the range of 0 to 1024. Measurements that were evaluated by a processor producing value for each sample. Serial port control device that is attached to the personal computer using serial contact to store data monitoring. The design and device of side-polished fiber sensor system as shown in Figure  1 and Figure 2.

B. Data Collection and Experimental Method
Data collection in this study is a data set of fiber sensor response signals that have been obtained from a previous study [15]. Sample response value ADC values and blood glucose levels were collected at different times so that there are 120 data samples as a whole. The method used as the test to measure urine glucose, which is sampling during fasting and postprandial (two hours after consuming food) is commonly used. The data collection of response signals form sensor on the fasting and postprandial glucose urine tests for each shown in Figure 3 and

C. Machine Learning For Classification
Machine learning is used in this study is free software Orange version 3.24.1 Orange is a visual programming using Python library workflow. The testing process machine learningbased training and performance testing carried out on the same software. This software is very easy to use to apply machine learning to process research data classification urine glucose levels. Orange workflow components called widgets in the form of a simple data visualization using the default parameters on Python library.
From of the sample collection method, preprocessing, quantitative assessment of the algorithm, prediction and classification are seen in interactive data visualization. Classification performance testing by comparing six machine learning algorithms are Naïve Bayes Classifier, k-Nearest Neighbor Classifier, Logistic Regression, Random Forest, Artificial Neural Networks, and Support Vector Machines.

1) Naïve Bayes Classifier
Naïve Bayes method is the supervised learning method use the statistical approach to inference induction on classification issues. The algorithm used to find the value of the highest probability to classify the test data in the most appropriate category. Classification techniques with probability and statistical methods are used to forecast possibility based on past data. Bayes algorithm uses a probabilistic model assumes Naïve Bayes theorem as in Equation (1).
Where X is the sample data of an unidentified group. H is the assertion that X is a data group. P(H) and P(X) were its frequencies and percentages of the experimental results being detected. P(H|X) is the possibility of the X sample results, if it is assumed that the hypothesis is valid.

2) k-Nearest Neighbor Classifier
Algorithm k-Nearest Neighbor Classifier has become a method to identify entities based on the training samples in a nearby spatial domain. The KNN operating theory is to consider the shortest gap between some of the data to be analyzed by the closest neighbor in the testing results. Classification of the object focused on the learning data that is nearest to the target. It is important to evaluate the importance of K neighbors while identifying distance data to the neighbor. K is a number of nearest neighbors. The position in the data point on the preparation and processing of data using the standard Euclidean formula seen in Equation (2) is used to describe the difference between two points.
There, d (a, b) is the Euclidean distance, X and Y are also the data 1 and 2, there is a function and n represents the number of functions. KNN algorithm implementation of the value of K neighbors initialized by 5 in this analysis.

3) Logistic Regression
Logistic Regression is a common method for creating a predictive model the probability of an event such as a linear regression. The logistic regression was used only if the variable output of the model used is defined as binary categories. In the equation, Pbj is the probability predicted by encodes them as 1, and (1-Pbj) estimated probabilities with other decisions and coded as 0. Logistic Regression can be formulated as shown in Equation (3).
Where βn is the slope of independent attributes. Xnj is independent tribute in the record j. n = number of independent tributes and j number of records in the dataset.

4) Random Forest
A method similar to the voting process for obtaining the prediction of the final classification. RF method of applying the bootstrap method of aggregating and random feature selection. In a random forest, many trees combine to form a forest. Each decision tree is constructed using random vectors, to insert a random vector in the formation of the tree by selecting an F random. F input attributes to be shared on each node in the tree decision be formed. The value of F can be determined by equation (4): F attributes that would be candidates for splitting each node attributes. The attribute that becomes the next node is determined based on certain criteria and the value of F is constant. The prediction results obtained from the model based on majority vote that often arise.

5) Artificial Neural Networks
Artificial Neural Networks (ANNs) is a network which mimics the function of nerve cells (neurons) in the human brain. Computational methods this approach is based on the interaction between neurons. There are two sections: the data layer and the output layer. In the multilayer neural network, the reference layer and the output layer, there is a hidden layer comprising the source node values used as input values in certain nodes. Neural networks can be formulated as in Equation (5).
a is input, w is weight and b is bias. The formula that is used as the activation function (AF) in this study is the Sigmoid, can be formulated as in Equation (6) ( ) = 1

1+ −
The parameters used in the neural network are input dimension 3 and number of hidden layers 5.

6) Support Vector Machine
The Support Vector Machine Principle works linearly and has been developed to be applied to non-linear problems. By using the kernel trick method to find hyper planes by transforming the dataset into vector space. SVM uses two main concepts to solve the problem of the separation of a large margin and kernel functions. Selection of the kernel function to determine the feature space where the training set will be classified. Kernel function in SVM defined by equation (7).
In this study used kernel function is radial basis function (RBF) with parameters C and Gamma with the equation (8).
Where γ is a kernel parameter. Parameters setting for the algorithm are using v-SVM type, regression cost (C=1) and RBF kernel.

D. Performance Analysis
Various machine learning techniques used in this research to test the efficiency of classifying urine glucose levels. Classification in this study is used to perform the classification of types of testing urine. Classification in this study resulted in ratings Negative, Positive 1, Positive 2, Positive 3, and Positive 4. Prediction and classification outcomes are then analyzed on the basis of accuracy, precision, recall and f1 measurement using the uncertainty matrix as seen in Table I.

1) Precision
Accuracy has become a metric scale used to measure the accuracy of the algorithm-generated classification tests. Accuracy developed in the same manner as in Equation (9).

2) Precision
Precision is the variable value used to measure the accuracy or rating value of the classified data tests. Precision can be formulated as Equation (10).

3) Recall
The recall is that the metric score is used to measure the sum of the class in which the test data are categorized into the data results class. Recall can be formulated as Equation (11).
4) F1-measure F1-Measure is a numerical value that measures the combination between precision and recall benefit, and it can be constructed as an equation (12).
III. RESULTS Data measurements on samples were collected from previous studies of a total of 120 urine samples for testing in this study. Visual programming with front-end features are used to analyze the data exploration and data visualization interactively. Visual programming in the Orange version 3.24.1 do as workflows using Python library. Orange application modules, called widgets workflow, consists of the data representation, subset collection, preprocessing, learning model validation and predictive analytics.
All models of prediction and classification level of urine glucose using a Naive Bayes classifier (NB), k-Nearest Neighbor Classifier (KNN), Logistic Regression (LR), Random Forest (RF), Artificial Neural Networks (ANN) and Support Vector Machine (SVM). Figure 4 shows a visual programming using workflow to evaluate learning algorithms. Data classification is done by using sampling techniques kfold cross validation with k = 10 and a leave-one-out validation. Using performance measurement: accuracy, precision, recall, and F1-measure. Cross validation is a computational method requiring the partitioning of information through subsets. The preparation of data in subsets and the usage of certain subsets to test the output of the prediction model. Validation with the k-fold cross validation implemented in this analysis has a total of k-folds of 10. In the first iteration, the first fold is used for test data and the rest is used for training data. The results of the measurements of the six machine learning algorithms with 10-fold cross validation are shown in Table II. The results of experiments conducted showed that the classification algorithm using the artificial neural networks method get validation results with the best precision value is 0.967, recall 0.967 and F1-Measure is 0.967. This shows that the ANN predictive algorithm has the capacity for 96.7% reliable classification and success assessment tests. Validation using a cross-validation of this method, all data but the dataset used for preparation and the datasets used in the processing. The error rate of the model classification algorithm is the average of the errors of each iteration. The results of the sixth measurement machine learning algorithms with a leaveone-out cross validation provides performance results as shown in Table III. For the leave-one-out cross-validation, the analytical findings reveal that the classification method that represents the highest consistency value for the 0.975 accuracy of random forest and artificial neural networks for 0.975 precision, recalls 0.975 and 0.975 F1-Measurement. While the ANN algorithm is superior in achieving the accuracy value of 98.6%. The performance classification chart using for fasting glucose urine tests is shown in Figure 7. The performance classification chart using for postprandial glucose urine tests is shown in Figure 8.

IV. CONCLUSION
This study uses machine learning algorithms to classify the data level of glucose in the urine. Before the classification, measurement process is carried out using side-polished fiber sensor and the response of the sensor signal is processed using a microcontroller into digital data. Machine learning is used for comparison of six classification algorithms are Naive Bayes classifier, k-Nearest Neighbor Classifier, Logistic Regression, Random Forest, Artificial Neural Networks and Support Vector Machine.
.Based on our experiments, the classification algorithm using artificial neural networks have the highest degree of accuracy compared with other tested classification algorithms. Experiments conducted with k-fold cross-validation reveal that the ANN has a precision value of 0.967, a recall value of 0.967, and an F1-Measurement value of 0.967, and a precision outcome of 96.7%. Cross validation leave-one-out, the RF and ANN 0.975, precision 0.975, recall 0975, and F1-Measure 0.975. While the ANN algorithm has an accuracy value of 98.6%.
However, the accuracy of the classification method using optimization parameters default value has not been carried out to improve accuracy. For further classification can add other machine learning methods such as deep learning. And add more samples with variations of urine glucose levels to calibrate the data measurement.