An Approach to ECG-based Gender Recognition Using Random Forest Algorithm

Human-Computer Interaction (HCI) has witnessed rapid advancements in signal processing research within the health domain, particularly in signal analyses like electrocardiogram (ECG), electromyogram (EMG)


I. INTRODUCTION
In the development of Human-Computer Interaction (HCI), research on signal processing in the health field, such as Electrocardiogram (ECG), Electromyogram (EMG), and Electroencephalogram (EEG) signals, [1], [2], has developed rapidly.ECG contains various important information related to the medical history, identity, emotional state, age, and gender of an individual, which is reflected in the ECG signal.Researchers have demonstrated the potential of ECG as a biometric recognition tool [3] and for gender classification [4], [5].ECG is a diagnostic tool that best represents the electro-physiological patterns of the depolarization and repolarization of the heart muscle with each heartbeat.ECG has been used extensively in the prognosis and diagnosis of various diseases and disorders [6], [7].ECG records the heart's electrical activity, a voltage versus time graph through electrodes placed on the skin [8].
ECG reflects unique and easily measurable changes in heart potential, making it a specialized tool for human identification.[9], [10].ECG depicts a graph of heart activity that is believed to have morphological or structural differences in each individual and remains stable over an extended period.[6], [9], [11], [12].In diagnosing heart disease, ECG is the most crucial recording of heartbeats [13].
Gender identification is a fundamental attribute crucial in various fields such as facial recognition [14], soft biometrics, and (HCI).Gender is one of the key elements in security systems, video surveillance, online purchases, the judiciary, transportation, and demographic information collection [15], [16].In the field of forensics, gender classification plays a role in assisting victims of criminal and civil cases, as well as the resolution of missing persons cases [17].
Gender classification is determining the male and female labels in biometric samples [18], [19].Gender labelling can be achieved through facial recognition methods, gait analysis, dental X-rays, text data, and even using ECG signals [3], [13].To facilitate the gender classification process based on ECG [20], a machine learning method is necessary.Machine learning can be defined as the process of extracting hidden patterns from a large dataset.Machine learning is widely applied in various fields [21].Using machine learning methods, tasks such as prediction, classification, filtering, and data grouping can be performed [22].One of the machine learning methods suitable for classification is Random Forest.
In previous research, a combination of Discrete Wavelet Transform (DWT), dimensionality reduction Independent Component Analysis (ICA), and Multilayer Perceptron (MLP) classification was employed for classifying arrhythmia diseases from the MIT-BIH ECG heartbeat signal data.The training and testing processes of the data utilized MLP-NN-based classification, yielding an accuracy of 96.50% for the classification method [6].
The study by [23] compared the decision tree algorithm and Support Vector Machine with multi-domain features from the MIT-BIH arrhythmia ECG signal data.The feature set consisted of eight features based on Empirical Mode Decomposition (EMD), three from Variable Mode Decomposition (VMD), and four from RR interval.The proposed method achieved the best results in decision tree classification with an accuracy of 98.89%, compared to SVM classification, which only reached an accuracy of 95.35%.The study by [24] combined DWT feature extraction with a 1-dimensional Convolutional Neural Network (CNN) algorithm for biometric authentication, using ECG-ID data with 90 subjects and only utilizing PQRST waves.The obtained results showed an accuracy of 92.2%.
The study by [25] utilized a method employing One-Dimensional Multi-Layer Co-Occurrence Matrices (1D-MLGLCM) to recognize individuals based on their ECG signals.These matrices were used to extract Haralick features for classification with algorithms such as Random Forest (RF), Support Vector Machine (SVM), Naive Bayes (NB), Bayes Net (BN), and K-Nearest Neighborhood (KNN).The proposed method achieved a success rate of 93.414% using SVM.The study by [6] compared the fine decision tree, medium tree, and ensemble RUSboosted tree algorithms with feature extraction from frequency and time domains for human identification.The accuracy results were Fine tree 95.2%, Medium tree 95.2%, and ensemble RUSboosted tree 95.5%.
This research aims to determine the prediction performance of gender classification based on ECG signals using the Random Forest method.The evaluation provides insights into the potential and effectiveness of the Random Forest algorithm in handling electrocardiogram (ECG) signal data.Furthermore, this research holds a significant potential impact on the broader field of biomedical signal processing, demonstrating that ECG-based gender classification is reliable and has relevant applications in developing non-invasive gender prediction methods within the healthcare domain.These findings contribute to advancing the understanding and application of Random Forest algorithms in the specialized area of gender classification using ECG signals, thereby enhancing the knowledge base in biomedical research.

II. MATERIAL AND METHODS
In general, the research process involves the classification results from the Random Forest method.These stages include data collection, partitioning the data into training and testing sets, testing the model, and evaluation.The proposed model can be seen in FIGURE 1.

A. DATASET
The dataset used in this study is a numeric signal dataset consisting of heartbeats, specifically known as the ECG ID database https://physionet.org/content/ecgiddb/1.0.0/.In this study, we utilized the entire dataset, which consists of two parts: raw data and filtered data.The data used in this research was obtained from the Physionet ECG ID, comprising 310 recording data from volunteers (44 males and 46 females aged between 13 and 75).The dataset distribution, the number of rows, and features can be seen in TABLE 1 and TABLE 2. The ECG data is continuous and consists of 10,000 features, labeled as X0 to X9999 with 310 records.It includes two classes, namely person_id and gender.The raw ECG signals are rather noisy and contain both high and low frequency noise components [26].The dataset will be classified immediately after the filtering process, without involving additional feature selection and pre-processing steps, refer to FIGURE 1 for an overview of our research flow.This approach is implemented to directly evaluate the performance of the Random Forest model on ECG data.In the context of ECG signals, the 10,000 features refer to the representation of an ECG signal with 10,000 different values, represented by variables or features labelled X0 to X9999.ECG signal is recorded as a voltage-versus-time graph and measured using electrodes placed on the skin.The many features represent data points taken from ECG signals at specific time intervals.Each value (feature) in this representation can reflect a specific aspect of cardiac electrical activity at a particular time [8] [27].

B. RANDOM FOREST
Random Forest is an Ensemble Learning algorithm that utilizes the basic concept of Decision Trees.Random Forest consists of multiple Decision Trees built randomly and combined into one model.Random Forest combines many Decision Trees based on the Bagging technique.Bagging enhances the diversity of base learners by employing random sampling, thereby improving the algorithm's overall generalization performance [10].To reduce the correlation between Decision Trees, Random Forest introduces random feature projection during the construction of each Decision Tree.This means that instead of applying all variables in one tree, each Decision Tree only selects a subset of features at each potential split in the Random Forest.Random feature projection can significantly reduce the correlation among trees because different trees grow on different feature sets, leading to smaller values [28].
Each tree is built using a random subset of data, and the final prediction is determined by a vote from all trees [29], [30].This approach improves accuracy, reduces overfitting, and works well for classification and regression tasks.The formula for the decision tree algorithm is used in Eq. (1).
where ℎ () is the prediction,  is the number of trees, and   () is the prediction of the  ℎ decision tree.Then, Bootstrap sampling as shown in Eq. (2).
Randomly select  samples with replacements from the original dataset   for each tree (Eq.( 3)).
where  is the total number of features,  is the number of features considered for splitting at each node, typically set to the square root of the total number of features as shown in Eq. ( 4).
where pi is the probability of S belonging to class i, and k is the dataset's number of classes or categories.Pi represents the proportion of the dataset that belongs to class or category i (Eq.( 5)).
( ̂) = (ℎ 1 (), ℎ 1 (), … , ℎ  ()) Random Forest prediction equation combines individual tree prediction and takes the mode as the final prediction.The algorithm proceeds with the following steps [31], [32], [33].The first step in the Random Forest algorithm is to select random samples from the database.Subsequently, a decision tree is constructed for each sample, and predictions are obtained from each decision tree.Afterwards, the frequency of each class result is counted.The most frequently occurring result is then selected as the final prediction for the Random Forest.Thus, this algorithm combines decisions from various trees to enhance overall prediction accuracy and reliability.
The study by [34] Random Forest algorithm involves adjusting several parameters, including two key parameters, to influence the model's performance.The two key parameters that impact the Random Forest model's performance are the number of trees (n_estimators) which represents the number of decision trees in the ensemble, specified as a positive integer.It signifies the quantity of decision trees to be constructed within the Random Forest model and the random number generator (random_state) parameter is utilized to set a seed value that controls the process of generating random numbers in trees.When a specific seed value is specified, each time the model is trained or predicted, the outcomes produced by functions utilizing random numbers will remain consistent.The number of trees contribute a role in controlling the model's complexity, where an increase in the number of trees can enhance complexity but also potentially raise the risk of overfitting.Conversely, the random number generator, controlled by the random_state parameter, is responsible for generating random numbers used in building the trees in the ensemble.Proper configuration of both these parameters at TABLE 5 becomes crucial in the effort to achieve a balance between model complexity and overfitting control, An example of the implementation of the n_estimator parameter in the Random Forest model can be observed through the number of trees depicted in FIGURE 4 [35], [36].

C. CONFUSION MATRIX
Confusion Matrix is one of the evaluation techniques in the form of a 2x2 matrix used to determine the success rate of a model by obtaining the number of correct classifications of the dataset into active and non-active classes using a classification algorithm [37].The confusion matrix is depicted as a square matrix where rows represent the actual classes of instances, and columns represent the predicted classes (TABLE 6).The Confusion Matrix generated by the model will be used to calculate accuracy.Accuracy is chosen for evaluating the model's performance because this research involves a classification case with balanced data [20] The accuracy is sufficient to determine the success rate of the model, and this rate is defined as the ratio of the correctly classified instances (Eq.( 6)).The Confusion Matrix formula is used in the equation [35], [38], [39]  Sensitivity is the ratio of true positive samples (TP) to the total number of samples that are actually positive (TP + FN) (Eq.( 7)).Specificity is the ratio of true positive samples (TP) to the total number of samples classified as positive (TP + FP) [25] (Eq.( 8)).

III. RESULT
This result calculates the accuracy produced by Random Forest, which is used to determine the success rate of the chosen method.However, the dataset used needs to separate the labels in the person_id column first because this study only requires the gender label.The results of the separation in the data can be seen in  The results of label separation data will be divided into two parts: training data and testing data.Training data are used to train the model while testing data are used to make predictions based on the trained data.The data division from TABLE 7 and TABLE 8 use 10-fold cross-validation [40], and the results of the training and testing data split can be seen in TABLE 9.The divided data will be classified using the Random Forest model with parameters n_estimator and random_state.We set n_estimator from 50 to 500 of decision trees to be constructed within the Random Forest model and the random_state parameter value 42 to set a seed that controls the process of generating random numbers in trees.Each dataset will yield accuracy, sensitivity, and specificity values, and the model results can be observed in

IV. DISCUSSION
The evaluation results of the Random Forest model in this study align with previous research [25] which used the Random Forest model with the person_id label in TABLE 3 and TABLE 4. This study, however, employs the gender label to identify gender from TABLE 7 and TABLE 8.The evaluation results can be observed from the differences in accuracy, sensitivity, and specificity values between the two datasets.Raw data achieved the highest values at n_estimator 300 with accuracy, sensitivity, and specificity of 55.000%, 46.452%, and 63.548%, respectively.For n_estimator 50 to 500, raw data's accuracy and specificity values remain stable.
The visualization of raw data results from TABLE 10 can be seen in FIGURE 5.

FIGURE 5. Visualization of Filtered Data Results
The filtered data achieved the highest values at n_estimator 500, with accuracy, sensitivity, and specificity reaching 65.806%, 67.097%, and 67.097%, respectively.The filtered data results in the range of n_estimator 50 to 500 show stable accuracy values, and sensitivity and specificity have better results in the range of n_estimator 300 to 500.The filtered data results from TABLE 10 can be visualised in FIGURE 6.

FIGURE 6. Visualization of Raw Data Results
From the two visualizations FIGURE 5 and FIGURE 6, raw data shows low sensitivity values for the n_estimator range of 50 to 500, while filtered data exhibits better sensitivity values in that range.One of the reasons for these differences can be observed in the visualization of the raw and filtered data signals in FIGURE 2 and FIGURE 3. Raw data tends to have more noise compared to filtered data.
From the noise level, it is evident that the most significant impact on the output results of this research is the decrease in sensitivity values.Therefore, noise has a considerable influence [41] on the outcomes of the n_estimator parameter in the random forest model.Further information regarding the highest values for both datasets can be found in TABLE 12 and FIGURE 7. The weaknesses in this study lie in the absence of feature selection to support the model and the selection of parameter values because there was no optimization to find the best parameters.Although this research does not delve deeply into these issues, it can serve as a starting point for further investigation.

FIGURE 7. Comparison Visualization of Filtered and Raw Data Results
We have comparison of classification results between our research and the previous study conducted by [20] in TABLE 13.Our research and [20] also utilizes the ECG ID Database for gender classification.In TABLE 13, our research results show lower values than the study [20].This is attributed to the fact that the Random Forest algorithm, which we employed, performs less effectively than the LSTM and Bi-LSTM models.Random Forest, as a representation of conventional machine learning, has not been able to compete optimally with deep learning algorithms such as LSTM and Bi-LSTM, which are more effective in processing sequential data.
The strength of LSTM in processing sequential data makes it superior in this context.Conversely, the advantage of Random Forest lies in feature selection, but in this research, its capability has not been fully utilized.For future research, several steps can be taken to improve the model's performance.One approach is to use a Hybrid CNN  The implications of this study contribute to knowledge by presenting the performance results of the Random Forest algorithm in gender classification.The comparison between raw and filtered data indicates that filtered data outperforms raw data when using the Random Forest model.Specifically, this research randomly assigns parameter values without prior testing to identify the optimal parameters.This approach could potentially lead to inaccuracies in classification results.

V. CONCLUSION
The study's evaluation results of the Random Forest model show that raw data performed with the highest values at n_estimator 300, achieving an accuracy of 55.000%, sensitivity of 46.452%, and specificity of 63.548%.On the other hand, filtered data achieved better results with the highest values at n_estimator 500, reaching an accuracy of 65.806%, sensitivity of 67.097%, and specificity of 67.097%.
The research uses raw and filtered datasets from this analysis, each exhibiting different performance characteristics.The sensitivity values in the raw data are notably lower across the range of n_estimators, indicating the impact of noise, especially in the sensitivity parameter.The visualizations of raw and filtered data signals further highlight the noise disparity.The evaluation reveals that the filtered data outperforms the raw data, achieving higher accuracy, sensitivity, and specificity values.The most significant drawback identified is the low sensitivity in the raw data, primarily attributed to the higher noise levels.The implications of this research suggest the need for noise reduction, feature selection, and parameter adjustments in future studies to enhance model performance.
The study provides insights into the challenges and outcomes of applying the Random Forest algorithm to gender classification based on ECG data.The study acknowledges a limitation in parameter selection, as there is no optimization for finding the best values, and emphasises the importance of addressing noise and optimizing parameters for better accuracy.Despite not delving deep into this issue, it is recognized as a potential area for further investigation.This research shows that the random forest model can determine an individual's gender information from ECG heart rate signal data.Considering the highly personal nature of medical information and the societal impact of this technology, it is crucial to be mindful of preventing the misuse of this technology on patients.
Therefore, the privacy of medical data must be carefully safeguarded.The findings contribute to knowledge by presenting the performance results of the Random Forest algorithm in ECG-based gender classification and contribute to the advancement of biomedical informatics regarding gender classification using ECG data.This is intended to facilitate experts in accurately identifying an individual's gender on a larger and broader scale through ECG signals.

FIGURE 1 .
FIGURE 1.The Research Flow of Random Forest Classification Models

FIGURE 3 .
FIGURE 2. (a) Raw signal Form of Male Gender (b) Filtered signal Form of Male Gender

TABLE 4 .
TABLE III andTABLE IV, row 2 with the male gender, and FIGURE3is from row 307 with the female gender.The contents of both datasets can be seen in TABLE3 and

TABLE 9 Training and Testing data split
TABLE 10 and TABLE 11.

TABLE 13 Comparison with existing work
LSTM, where CNN can extract the best features while LSTM can