Predicting the Need for Cardiovascular Surgery: A Comparative Study of Machine Learning Models

This research examines the efficacy of ensemble Machine Learning (ML) models, mainly focusing on Deep Neural Networks (DNNs), in predicting the need for cardiovascular surgery, a critical aspect of clinical decision-making. It addresses key challenges such as class imbalance, which is pivotal in healthcare settings. The research involved a comprehensive comparison and evaluation of the performance of previously published ML methods against a new Deep Learning (DL) model. This comparison utilized a dataset encompassing 50,000 patient records from a large hospital between 2015-2022. The study proposes enhancing the efficacy of these models through feature selection and hyperparameter optimization, employing techniques like grid search. A novel aspect of this research was the comparison of a newly developed DNN model with existing ensemble models based on similar cardiovascular datasets. The results indicated the DNN model's superior predictive accuracy, demonstrating an Area Under the Curve (AUC) of 74%, alongside notable precision (68%) and recall (72%) for the minority class, which indicates patients requiring surgery. The model further achieved a 70% F1-Score and a balanced accuracy rate of 72%, significantly outperforming the existing ensemble models in every key performance metric. The study underscores the transformative potential of DNNs in predictive modeling for cardiovascular care and highlights the importance of integrating advanced ML techniques into clinical workflows. Future research should delve into the practical application and integration of these models.


I. INTRODUCTION
In today's world, where electronic health data is widely collected and stored, there's an increasing conviction that Artificial Intelligence (AI), particularly its branch of Machine Learning (ML) could revolutionize healthcare practices.The utilization of machine learning (ML) to support understanding of cardiovascular disease represents a burgeoning field, with potential to enhance diagnostic precision and tailor treatment strategies [1].Incorporating ML to support cardiovascular care is a prime example of how these advanced technologies can help inform healthcare practices, especially in critical surgical settings [2].These sophisticated systems can help inform diagnostic methods and personalize treatment plans.However, compared to traditional medical diagnostics, its effectiveness remains a subject of ongoing research.Studies have shown that ML can be particularly valuable in areas rich with complex and large datasets.The utilization of ensemble machine learning models and deep learning techniques in healthcare data analysis represents a significant paradigm shift in predictive modeling, particularly in critical clinical decision-making contexts.Ensemble models, which combine the predictive power of multiple base models, and deep neural networks (DNNs), a specialized category of machine learning algorithms, are pivotal in this transformation [3].These complementary approaches empower healthcare organizations to make more accurate and informed decisions based on the intricate nuances of the data that support an advancement in the field [4].Predicting the need for surgery is a critical area, and ML can profoundly impact clinical settings due to the cost and risks associated to cardiovascular surgery [5].ML employs algorithms that dynamically adapt to data, creating models capable of efficiently processing and understanding complex data.Machine learning is increasingly important in cardiovascular surgery, offering the potential to reduce costs.Machine learning can optimize surgical and postoperative processes by analyzing data to improve support treatment strategies and patient care.This is crucial in cardiovascular care, where surgery costs are high.Effective use of machine learning can lead to more efficient surgical planning, fewer complications, and better patient outcomes, thereby reducing the financial burden of cardiovascular surgeries [6][7][8].While ensemble models excel at capturing patterns and insights from diverse and extensive healthcare datasets, DNNs, with their remarkable capacity to extract intricate patterns in rich and large healthcare datasets, also play a central role.Traditional statistical methods struggle with the multi-dimensionality and volume of clinical data, whereas ensemble models and DNNs offer a more nuanced and accurate approach.Machine Learning (ML) techniques, including ensemble and Deep Neural Networks (DNNs), excel at analyzing vast and intricate datasets, demonstrating benefits in forecasting clinical results [9,10].The challenge of class imbalance in surgical datasets, where non-surgical cases often outnumber surgical cases, is a significant hurdle in predictive modeling.Methods such as the Synthetic Minority Over-sampling Technique (SMOTE) and Adaptive Synthetic Sampling (ADASYN) are employed to mitigate this challenge, successfully equalizing datasets to enhance the model's forecasting accuracy [11].Feature selection is also crucial in determining the efficacy of DNN models.Techniques like SelectKBest and Random Forest reduce data dimensionality, enhancing model efficiency and interpretability by selecting the most influential features on the target variable prediction [12].This research aims to compare a Deep Neural Network (DNN) and several previously published ensemble ML models to forecast the need for surgical intervention using a dataset of 50,000 patient records from an Iranian hospital.The data was collected between 2015 to 2022.This study also addresses critical issues in clinical datasets like class imbalance and feature selection.The suggested method utilizes a grid search algorithm for optimizing hyperparameters, providing a methodical way of adjusting parameters that dictate machine learning models' learning procedures and structures.Differing from model parameters, hyperparameters are predetermined before the training phase and have a notable impact on the model's effectiveness.The rationale behind selecting these specific hyperparameters was to balance the model's accuracy and generalizability to unseen data and ensure equitable prediction performance for both minority and majority classes in our imbalanced dataset [13,14].In this study, the selection of Deep Neural Networks (DNNs) over Convolutional Neural Networks (CNNs) is strategically justified by the nature of our data and the specific objectives of our research.Unlike CNNs, which excel in extracting features from image-based data through their specialized architecture, DNNs are more adept at processing the structured, high-dimensional datasets commonly found in healthcare analytics.This capability is particularly crucial for our focus on cardiovascular disease prediction, where the data encompasses a wide array of variable types-numeric, categorical, and binary.DNNs leverage their deep, layered architecture to uncover complex patterns within such datasets, negating the need for manual feature extraction and thus aligning perfectly with our methodology [19].
Research question: How can machine learning models, including deep neural networks, improve the prediction of surgical interventions for cardiovascular disease patients?
The distinctive contributions and significant methodologies employed in this study, when compared to prior approaches, encompass: 1) A new hybrid model incorporating feature selection, resampling, and a DNN. 2) A comparison between the proposed DNN and previously published ensemble models used on cardiovascular disease datasets with similar variables.3) Feature engineering and data augmentation on high dimensional data.4) Applying a grid search approach, enabling precise adjustment of hyperparameters, class weights, and decision-making thresholds enhancing the model's ability to accurately predict outcomes for both minority and majority classes.5) Enhancing the recall for the minority class (the important class based on the problem definition) while keeping the precision high.
The following sections describe the dataset and the preprocessing methods, including data feature engineering, and feature selection for dimensionality reduction, feature scaling, and handling imbalanced data problems.TABLE 1 describes the studies selected for this comparative research and the methods that were used in them.

II. METHODOLOGY
In this study, we strategically selected a subset of ensemble models previously applied to cardiovascular-related datasets.
Our focus was on models that demonstrated potential in handling large, imbalanced datasets, as this aligns with our study's context and objectives.We did not include all models historically used in this domain but chose representative ones that offered varied cardiovascular disease outcome prediction approaches.This selection was intended to provide a comprehensive overview of different methodologies and their performance in similar scenarios.The comparison is based on several performance indexes crucial for evaluating models on imbalanced datasets, commonly seen in disease prediction, including precision, recall, F1-score, AUC (Area Under the Curve), and balanced accuracy.These metrics collectively provide a comprehensive assessment of model performance, ensuring robustness and reliability in predicting cardiovascular disease outcom.FIGURE 1 shows the research methodology diagram, including different steps towards comparing the models.The aim of this research is to enhance the predictive performance metrics for the minority class without compromising the metrics for the majority class.All the utilized models are coded in Python and carried out in google Colab.

A) DATASET INFORMATION
The

B) DATA PREPROCESSING
Data preprocessing is a critical task in data mining.The data must be of optimal quality, making it appropriate for use with various ML models.Enhanced data quality directly correlates with improved forecasting accuracy of ML models.Highquality data typically refers to accurate, complete, relevant, and consistently formatted information in healthcare settings [20].The utilized preprocessing approaches shown in Figure 1 are discussed as follows: 1

. FEATURE ENGINEERING
This study employed feature scaling on continuous variables to enhance the predictive performance of the ML models.This process is crucial given the diverse nature of the utilized clinical data containing a mix of categorical, numeric, and binary variables.In datasets where numeric features have different ranges or units of measurement, feature scaling becomes a critical preprocessing step.This process involves adjusting the scales of various numeric features to a uniform range, ensuring that no single feature disproportionately influences the model's performance due to its scale.This study used Standard Scaler from Scikit-Learn for the numeric variables in our dataset.Standard Scaler transforms the data such that the distribution of each feature has a mean of zero and a unit variance, not a unit standard deviation.This means that while the variability of each feature is maintained, the values are rescaled to ensure that each feature contributes equally to the analysis.This proves especially advantageous for algorithms that react to the scale of input variables, like neural networks.This normalization process was applied uniformly across both the training and test datasets to maintain consistency in data processing [21,22].This study employed preprocessing techniques that are suitable for structured clinical and demographic data, rather than methods traditionally used in signal processing such as Root Mean Square (RMS), Mean Absolute Value (MAV), or Slope Sign Changes (SSC).This strategy, aimed at addressing the dataset's high dimensionality and diversity, did not specifically engage in time, frequency, or wavelet domain feature extractions, focusing instead on streamlining the dataset to emphasize the most informative features for cardiovascular surgery prediction [23].In this study's feature engineering phase, we employed several techniques to enhance our clinical dataset's analytical utility.Though interconnected, each method serves a distinct analytical purpose.Aggregating related variables created aggregate metrics, offering a holistic view of complex health conditions.For instance, two combined blood pressure metrics were generated called Avg_DBP and Avg_SBP by averaging different diastolic and systolic blood pressure measurements to reduce dimensionality of the data.Lastly, developing indices to quantify health risks involved synthesizing multiple health variables into a single risk score.
Our study improved the dataset by creating indices that better represent health risks.The 'DentalHealthIndex' was developed by summing 'OralNumTeethDecayed' and 'OralNumTeethMissing,' providing a simplified dental health measure.We also calculated 'PregnancyLoss' to reflect reproductive health, derived from the difference between 'NumberPregnancies' and 'NumberLiveBirths.'Additionally, we introduced body composition ratios, 'WaistToHipRatio' and 'WaistToHeightRatio,' as indicators of obesity-related health risks.We removed the original variables that contributed to these aggregates.Feature engineering was taken to streamline the dataset and eliminate redundancy.The new aggregates offered a more comprehensive view of health risks and conditions, optimizing the dataset for effective machine learning analysis by focusing on the most informative and synthesized features.These techniques, while related, uniquely contribute to the depth and accuracy of our analysis [24,25].Interactions of disease with age factors have been considered, and One-hot encoding and categorical binning were used to transform categorical variables into a format that can be provided to ML algorithms.An average frequency score for specific symptoms quantifying their occurrence and severity was also calculated.This involved assigning custom weights to different levels of symptom frequency and severity.For instance, reflux frequencies like 'Almost every day' were assigned a higher weight (5) compared to 'Never' (0).Similarly, severity levels ranged from 'Slight' (1) to 'Lack of signage' (5).These methods played a crucial role in refining the raw data, enhancing its suitability for in-depth analysis.The prior description encompasses only a portion of the extensive feature engineering performed, with many more modifications undertaken to ensure a thorough and nuanced dataset evaluation.

FEATURE SELECTION
In the domain of machine learning, selecting features is a crucial procedure that boosts the efficacy of models by pinpointing and employing the key features from the data collection [26].The utilized feature selection method in each of the selected models is shown in TABLE 3. Model [15] is the only model for which feature selection is not used.The DL model in this study, similar to model [11] and [17] employs Random Forest feature importance, a robust method that evaluates and ranks features based on their contribution to model accuracy.This approach not only streamlines the model by reducing dimensionality and computational complexity but also improves predictive accuracy by focusing on significant, non-redundant features [12,27].By effectively eliminating irrelevant and redundant data, Random Forest feature importance aids in mitigating overfitting and enhancing the generalizability of the model to new data [28].Figure 2 illustrates a bar chart that identifies the most influential factors in predicting the necessity for cardiovascular surgery intervention.This step revealed the relevance of obesity and body composition indicator features as the most important variables.This corresponds with established studies that associate obesity, a significant marker of cardiovascular health, with a heightened risk of cardiovascular disease (CVD) and overall mortality.Additionally, the variable 'PhysActExMildMinsDuringWork', indicative of mild physical activity levels during work hours, emerges as a significant predictor.This finding highlights the role of occupational physical activity, or lack thereof, in cardiovascular health and supports existing studies that associate sedentary behavior with heightened risks for CVD and mortality [29][30][31].
Age is the other most important feature, which is consistent with extensive medical literature that recognizes age as a fundamental risk factor for cardiovascular diseases (CVD).As age increases, the risk of CVD typically increases due to various factors like arterial stiffness, hypertension, and endothelial dysfunction [32].The bar chart prominently displays reflux symptom indicators as significant features, suggesting that symptoms of reflux, possibly related to gastroesophageal reflux disease (GERD), occurring frequently over the years, have a substantial impact on the model's predictions.This aligns with medical findings that GERD can be associated with respiratory conditions and might indirectly influence cardiovascular health [33]."OralNumTeeth" and "DentalHealthIndex" suggest an interesting connection between dental health and Poor oral health, particularly periodontal disease, has also been associated with higher rates of CVD [34].Familyrelated features like "FamilySize" and "Currently working" suggest socioeconomic factors and family support structures may also play roles, potentially affecting stress levels and access to healthcare, which are known contributors to cardiovascular health [35].Avg_DBP and Avg_SBP, representing average diastolic and systolic blood pressure, respectively, are key cardiovascular health indicators.
Hypertension is a major risk factor for CVD, and its management is necessary in preventing the need for surgical intervention [36].Drug abuse indicators like "OpiateEverRegularFlag" are also shown as influential features of cardiovascular problems [37,38].In conclusion, the feature importance bar chart provides a valuable insight into the range of factors that the machine learning model has identified as significant in predicting the need for cardiovascular surgical intervention.The factors span from direct clinical measures to lifestyle and socioeconomic indicators, offering a multifaceted view of patient health.These findings are largely consistent with established medical knowledge and underscore the complexity of cardiovascular disease risk assessment [39].

APPROACH TO IMBALANCED DATA PROBLEM
An imbalanced data problem is a common challenge in predictive models based on clinical datasets because positive cases of a disease or diagnosis are typically part of a minority-class sample [26].In this case, the majority class samples representing patients with a negative diagnosis are remarkably higher than the minority class samples.Therefore, ML models are more proficient in forecasting outcomes for the majority class compared to the minority class in datasets with imbalances.The clinical data for this study is imbalanced, with 38 percent of the total patients having undergone Cardiovascular Surgery.Multiple strategies have been developed to tackle the challenge of datasets with imbalances, typically applied during the preprocessing stage.Among various strategies to address data imbalance, resampling is widely adopted, particularly in scenarios where collecting or augmenting new data is impractical or impossible.Resampling balances the dataset by modifying the representation of classes [40].SMOTE and ADASYN are two popular methods used to tackle imbalanced datasets in machine learning.SMOTE enhances data balance by generating artificial samples in the minority class.The process involves selecting a random minority class instance and then identifying its nearest neighbors in the feature space.It then synthesizes new examples by adding a scaled difference between the chosen point and its neighbors, effectively creating similar yet distinct instances [41].ADASYN builds upon SMOTE by introducing a more focused approach in generating synthetic samples.It prioritizes those minority class samples that are more challenging for the model to learn, determined by the proximity of majority class samples.ADASYN adapts the number of synthetic samples based on the learning difficulty of each minority class sample, producing more synthetic data in regions where the classifier struggles the most.This strategy leads to a more nuanced balancing of the class distribution, particularly improving model performance in more complex or challenging areas of the data space.Both techniques aim to equalize the representation of classes in imbalanced datasets, enhancing the robustness of machine learning models [42].The specific resampling methods used in each model is shown in table 3.

C) MACHINE LEARNING MODELS
Various machine learning models are capable of resolving classification issues [43].This study focuses on evaluating and comparing different ensemble machine learning models, specifically those previously applied to similar datasets, in the context of cardiovascular disease data.Moreover, it proposes a Deep Learning model that targets some of the weaknesses of prior approaches.

DEEP LEARNING MODEL
In this research, a deep learning model was created using TensorFlow and Keras, tailored to the specific requirements of our clinical dataset.The dataset, initially preprocessed for data quality enhancement, was split into training and testing sets in an 80-20 ratio.This division occurred before the standardization process, ensuring that the training and test sets were independently standardized.Continuous features in both sets were normalized using the Standard Scaler from Scikit-Learn, which adjusted each feature's distribution to have a mean of zero and unit variance.This normalization was essential for algorithms that are responsive to the scale of input attributes, particularly neural networks.Poststandardization, the training set underwent hybrid resampling using ADASYN (Adaptive Synthetic Sampling) to address class imbalance.This method effectively balanced the dataset by generating synthetic samples in the minority class, enhancing the model's ability to learn from a more data distribution.Class weights were computed to manage the imbalance within the training data further.These weights were used during the model training phase to ensure a balanced representation of both minority and majority classes, allowing for more effective learning.The deep learning model, constructed as a Sequential model in Keras, included multiple dense layers with ELU (Exponential Linear Unit) activations and dropout layers.The model's architecture and hyperparameters, such as the number of layers, units per layer, dropout rates, and learning rate for the Adam optimizer, were optimized using Keras Tuner's Random Search.This optimization process was vital for tailoring the model to our dataset's specific characteristics.
The training involved binary cross-entropy as the loss function, with class weights incorporated to handle the imbalance.Early stopping based on validation loss was implemented to prevent overfitting, ensuring the model ceased training at the optimal point.The model's performance was evaluated on the test set, focusing on metrics such as accuracy, F1 score, balanced accuracy, precision, and AUC score.Particular emphasis was placed on the recall score, given its importance in the context of our problem definition.This comprehensive evaluation provided insights into the model's effectiveness in classification tasks.The flow diagram of the deep learning model, as illustrated in FIGURE 3, outlines this structured process, from data preprocessing to model evaluation.This approach ensured the robustness and reliability of the model, making it a valuable tool in our study's machine-learning arsenal.
where X is the input data, μ is the mean, and σ is the standard deviation of the training data features.ADASYN Resampling can be formulated as shown in Eq. (2).
where   is a randomly selected neighbor from the k nearest neighbors of xi in the minority class, and λ is a random number between 0 and 1.This process is repeated until the class distribution is considered balanced, with the number of synthetic samples for each minority class sample being adjusted based on the density distribution of the minority class.Sequential Model with Dense Layers can be encapsulated by Eq. ( 3).
Where hi−1 is the previous layer's output or the first layer's input, Wi and bi are the weights and biases, and ELU is the activation function.Dropout is shown in Eq. ( 4).
where ⊙ denotes element-wise multiplication, and Mj is the  ℎ element of the mask M, indicating whether the  ℎ unit in the layer output   is kept or dropped.Binary Cross-Entropy Loss is shown in Eq. ( 5).
Where   is the true label,  ̂ is the predicted probability, and N is the number of samples.Adam Optimizer's procedure can be formulated as shown in Eqs. ( 6), ( 7), ( 8), ( 9), and ( 10) respectively.
where   is the gradient at time step,   and   are estimates of the first and second moments,  1 and  2 are decay rates for these moment estimates,  ̂ and  ̂ are bias-corrected estimates, η is the learning rate, and ϵ is a small scalar added to improve numerical stability.Hyperparameter Tuning (Random Search) is indicated in Eq. (11).
where  * is the set of optimal hyperparameters found, H represents the hyperparameter space, f denotes the evaluation metric (e.g., AUC), and  ,   are the training and validation datasets.

III. RESULTS
There are several metrics to evaluate the predictive performance of ML models depending on the context and objective of the model.This study uses hold-out strategy and assigns 80 percent of the data to the training set and 20 percent to the test set.The imbalance in the dataset, characterized by a lower proportion of patients undergoing surgery compared to those who don't, necessitates metrics that can accurately predict the smaller class.Precision measures the proportion of true positive predictions in the positive class, while recall (sensitivity) assesses the model's ability to identify all relevant cases correctly.In medical predictions, high recall might be more crucial as it is essential to identify as many actual cases (patients needing surgery) as possible [44].The F1 score is calculated as the harmonic mean of precision and recall, which is two times the product of precision and recall divided by their sum, offering a balanced measure of both metrics.Unlike the arithmetic mean, which averages values directly, the harmonic mean in the F1 score penalizes extreme values more, ensuring a balanced account of both false positives and false negatives in model evaluation.It is beneficial in situations of class imbalance, as it considers both false positives and false negatives, providing a balance between precision and recall.Area Under the Receiver Operating Characteristic Curve (AUC-ROC) is valuable for evaluating the classifier's performance across various threshold settings.It is effective in handling imbalanced datasets as it measures the ability of the model to discriminate between the two classes [45].Balanced accuracy calculates the average of recall obtained in each class.Balanced accuracy is beneficial for imbalanced datasets as it considers the balance of measurements for each class, providing a more fair comparison when the class distribution is skewed [46].Mathematical equations for evaluation metrics are listed as Equation ( 12), ( 13), (14), and .
Balanced Accuracy = In ensemble machine learning models, a myriad of configurations exists, each with potential applicability in the domain of cardiac and cardiovascular disease prediction.This paper undertakes a comparative analysis of prevalent ensemble models in the context of cardiovascular disease prediction, introducing a deep learning model developed to address specific weaknesses identified in previous approaches.These weaknesses include the limited ability to handle the imbalanced nature of healthcare data effectively, challenges in capturing complex non-linear relationships inherent in clinical information, and difficulties in managing high-dimensional feature spaces.The proposed deep learning model is designed to target these challenges, offering a more robust tool for predictive analysis in this critical domain.The efficacy of this model, in comparison to established ensemble models, will be systematically evaluated to determine its relative performance.Ghavidel et al. [11] presented a stacking ensemble model, which, in their study, surpassed traditional classification approaches in performance metrics.Similarly, Ghorbani et al. [16], reported an innovative approach that combines a Genetic Algorithm with a stacking ensemble, incorporating a suite of base classifiers Logistic Regression, Decision Tree, Random Forest, and XGBoost to predict early mortality in intensive care units, particularly within highly imbalanced datasets.Additionally, AdaBoost emerged as the top-performing model in the study by Linda Lapp et al [15].it showed the best overall performance with an AUC of 0.73.However, in terms of sensitivity (recall) considering the importance of minority class prediction Gradient Boosting Model (GBM) and RF performed better than others.The present study implemented all three models on the data and selected the highest performer which was the Stacking (GLM) model from Linda Lapp's study.Next, we compared the ensemble models used in the study conducted by Talayeh Razzaghi et al. [18] and found that AdaBoost outperformed the others.The performance of these models, when applied to the dataset of the current study, is comprehensively detailed in Table 4.
In this study's meticulous quest to ascertain the most effective model for predicting cardiovascular surgical interventions, we evaluated numerous ensemble models against a deep learning (DL) model we developed, which is tailored to address the inherent challenges of such predictions.These ensemble models, denoted as [11], [15], [16], [17], and [18], were rigorously analyzed across multiple performance metrics, as illustrated in Figures 4, 5, and 6, and summarized in Table 4. Model [11], while demonstrating high precision (71%) for the majority class, fell short with a recall of only 27% for the minority class, which translates into a critical deficiency in the context of surgical intervention prediction.The capacity to detect true positives is paramount; thus, a model that misses potential surgical cases is less suitable despite its strength in specificity, as evidenced by its performance for the majority class with a recall of 82%.Model [11] achieved an AUC of 73 and a balanced accuracy of 55%, yet the low recall for the minority class is a significant limitation in a clinical setting.Model [15] offered a more balanced performance for class 1 with precision and recall scores of 63% and 50%, respectively.However, in the context of surgical interventions, the recall rate still implies that half of the critical cases could be overlooked, which is far from optimal.For the majority class, it delivered robust precision and recall, culminating in an AUC of 70 and a balanced accuracy of 67%.Model [16] presented a nearequilibrium between precision and recall for the minority class, suggesting a harmonious identification process.Nonetheless, the model's modest recall rate of 58% for predicting surgical interventions may lead to a concerning rate of missed diagnoses despite an AUC of 72 and a balanced accuracy of 67%.Model [17] produced a precision of 61% for the minority class, indicating reasonable accuracy, but its lower recall of 38% signifies a considerable risk of missed surgical cases.This model's performance for the majority class was substantially better.However, the emphasis on the minority class in our study's context is critical, and thus, the model's overall utility is diminished.Model [18] showed promise with a recall of 67% for the minority class, suggesting a potent ability to identify a higher proportion of patients requiring surgery.However, its precision of 54% raises concerns about the false positive rate, which, while less critical than false negatives in our study's context, still reflects the need for refinement.With its refined architecture and nuanced learning capabilities, the proposed DL model outperformed the ensemble models by securing the highest AUC score of 74%.It achieved an impressive balance of precision (68%) and recall (72%) for the minority class, signifying a robust capability to detect cases requiring surgical intervention accurately and comprehensively.This balance is especially crucial in medical diagnostics, where overlooking a single patient in need of surgery could have dire consequences.Moreover, the DL model exhibited solid precision (77%) and recall (71%) for the majority class, underscoring its reliability, and the F1 scores, along with a balanced accuracy of 72%, reinforce its all-encompassing effectiveness.The comparative analysis of the models' effectiveness in classifying the minority class is visually represented in FIGURE 4, while FIGURE 5 provides a comprehensive overview of the overall performance of the models employed.The distinct advantage of the DL model lies in its intricate neural network structure, which effectively captures the complex, nonlinear interactions within the highly dimensional clinical data.The model's hyperparameter optimization, employing Keras Tuner's Random Search, advanced activation functions such as ELU (Exponential Linear Unit), and dropout regularization, contributed significantly to its robustness.Importantly, introducing class weights in the model's loss function counteracted the dataset's imbalanced nature, ensuring an equitable focus on the minority class.This strategic and multifaceted approach resulted in a DL model that excels in predictive accuracy and in delivering equitable performance across classes, making it a reliable tool for clinical decisionmaking in cardiovascular care.The performance disparities among the ensemble models and between these models and the proposed DL model underscore the nuanced considerations when choosing an appropriate predictive model for clinical application.Our analysis reveals that while some models may excel in precision, others may prioritize recall, and the DL model's capability to balance these metrics effectively positions it as a superior choice in the context of our study.Based on the calculated confusion matrixes, it can be inferred that the DL model demonstrates superior performance in accurately predicting both positive (surgery) and negative (no surgery) cases compared to the other models [11], [15], [16], [17], and [18] (FIGURE 6).

IV. DISCUSSION
The present study conducts a detailed comparative analysis of various machine learning models on cardiovascularrelated datasets, focusing on the performance of the minority class, a critical aspect in this domain.Our analysis revealed varying degrees of effectiveness among the models in predicting outcomes for the minority class, highlighting the inherent trade-offs in predictive modeling.
The deep learning (DL) approach proposed in this study demonstrated superior performance, exhibiting high precision and recall for the minority class.This balance is particularly significant in medical diagnostics, where the ability to accurately identify true positive cases (high precision) and minimize false negatives (high recall) is paramount.Our DL model's balanced enhancement of these metrics underscores its reliability and effectiveness for clinical decision-making.In contrast to our DL model, ensemble models, including [11], [15], [16], [17], and [18], exhibited notable weaknesses in their performance metrics, particularly for the minority class.Model [11], while precise for the majority class, significantly lacked recall for the minority class, posing a risk of overlooking crucial cases.Model [15] offered a better balance but failed to detect critical surgical cases optimally.Model [16], despite its balanced approach, struggled to capture all positive instances effectively.Model [17] showed moderate precision but a marked deficiency in recall, potentially leading to missed surgical interventions.Model [18], with its higher recall for the minority class, indicated a stronger ability to identify true positives, but its lower precision raised concerns about an increased rate of false positives.These models, while effective in certain aspects, underscore the complexities and trade-offs inherent in predictive modeling for medical diagnostics.
The study focused on feature selection to improve predictions for cardiovascular surgical interventions.It identified obesity, body composition, and age as top predictive factors for cardiovascular disease (CVD), which is supported by existing medical literature.Other significant predictors include gastroesophageal reflux disease (GERD), dental health, and socioeconomic factors such as family size and employment status.Additionally, blood pressure levels and drug abuse history emerged as important indicators.These findings underscore the complex interplay of medical, lifestyle, and socioeconomic factors in assessing the risk of CVD and the need for surgical intervention.This research underscores the importance of precision and recall in predicting the minority class in cardiovascular datasets.The DL model demonstrates proficiency in accurately identifying true positives (with a recall of 72% for class 1, indicating successful identification of 72% of true positive cases) and effectively minimizing false positives, as evidenced by its precision metric.This balance, alongside the model's high balanced accuracy and AUC score, reinforces its suitability in scenarios that demand a nuanced understanding of both classes.Our DL model effectively navigates these challenges, achieving a superior balance of precision and recall, a critical factor in the high-stakes realm of cardiovascular surgical intervention prediction.This study highlights the significant advantages of Deep Neural Networks (DNNs) in improving the accuracy and efficiency  The study acknowledges limitations regarding the generalizability of findings.The effectiveness of the DL model may vary across different datasets or clinical scenarios, and its performance is inherently tied to the data quality and representativeness.Furthermore, the complexity of the DL model's architecture may pose challenges in certain clinical settings, necessitating ongoing monitoring and recalibration.Another limitation of the present study's predictive modeling approach is its inability to account for the dynamic clinical status of patients.The models generate risk scores based on data from a single time point, overlooking ongoing physiological changes.This static approach may not accurately reflect the evolving nature of a patient's health condition, which is a critical aspect of realtime clinical decision-making.Future enhancements to our models should aim to incorporate dynamic data analysis, allowing for more accurate and timely predictions that align with the rapidly changing clinical scenarios often encountered in healthcare.

V. CONCLUSION
The aim of this study was to conduct a comparative analysis of machine learning models on cardiovascular-related datasets, focusing on predicting outcomes for surgical interventions.Our findings highlight the superior performance of the deep learning (DL) approach, which demonstrates a balanced enhancement of precision and recall for the minority class.Future research should focus on integrating these ML techniques into clinical workflows, incorporating dynamic data analysis for more accurate predictions, and addressing limitations related to data quality and representativeness.

Disclosure of interest:
The authors report there are no competing interests to declare.

Consent to Publish
Consent for publication was obtained from all participants for whom identifying information is included in this article.

Data Availability:
Due to the sensitive nature of the medical data used in this study and in strict compliance with patient confidentiality and privacy regulations, the dataset cannot be publicly shared.

FIGURE 2 .
FIGURE 2. Variables extracted by RF feature importance.

FIGURE 3 .
FIGURE 3. The Deep Neural Network model's flowchart Data Standardization for continuous variables can be formulated as shown in Eq. (1).

FIGURE 7
FIGURE 7 presents the ROC-AUC curves of the three models that achieved the highest AUC metrics, clearly

FIGURE 6 .
FIGURE 6.Comparison of the models' confusion matrixes

FIGURE 7 .
FIGURE 7. Comparison of ROC-AUC curves of the models with highest AUC-scores

TABLE 2 Main variables of the dataset.
Is the sum of two variables -OralNumTeethDecayed and OralNumTeethMissing where a higher number indicates poorer dental health.