Comparison of CatBoost and Random Forest Methods for Lung Cancer Classification using Hyperparameter Tuning Bayesian Optimization-based

Yra Fatria Zamzam; Triando Hamonangan Saragih; Rudy Herteno; Muliadi; Dodon Turianto Nugrahadi; Phuoc-Hai  Huynh

doi:10.35882/jeeemi.v6i2.382

Yra Fatria Zamzam Computer Science Department, Lambung Mangkurat University, Banjarbaru, South Kalimantan, Indonesia https://orcid.org/0009-0001-0830-635X
Triando Hamonangan Saragih Computer Science Department, Lambung Mangkurat University, Banjarbaru, South Kalimantan, Indonesia https://orcid.org/0000-0003-4346-3323
Rudy Herteno Computer Science Department, Lambung Mangkurat University, Banjarbaru, South Kalimantan, Indonesia https://orcid.org/0000-0003-0637-8090
Muliadi Computer Science Department, Lambung Mangkurat University, Banjarbaru, South Kalimantan, Indonesia https://orcid.org/0000-0003-2871-9482
Dodon Turianto Nugrahadi Computer Science Department, Lambung Mangkurat University, Banjarbaru, South Kalimantan, Indonesia https://orcid.org/0000-0001-7746-2658
Phuoc-Hai Huynh Computer Science Department, An Giang University, Vietnam National University, Ho Chi Minh City, Vietnam https://orcid.org/0000-0001-8348-9267

DOI: https://doi.org/10.35882/jeeemi.v6i2.382

Keywords: Lung Cancer, CatBoost, Random Forest, Bayesian Optimization

Abstract

Lung Cancer is a disease that has a high mortality rate and is often difficult to detect until it reaches a very severe stage. Data indicates that lung cancer cases are typically diagnosed late, posing significant challenges to effective treatment. Early detection efforts offer potential for better recovery chances. Therefore, this research aims to develop methods for the identification and classification of lung cancer in the hope of providing further knowledge on effective ways to detect this condition at an early stage. One approach under scrutiny involves employing machine learning classification techniques, anticipated to serve as a pivotal tool in early disease detection and enhancing patient survival rates. This study involves five stages: data collection, data preprocessing, data partitioning for training and testing using 10-fold cross validation, model training, and analysis of evaluation results. In this research, four experiments consist of applying two classification methods, CatBoost and Random Forest, each tested using default hyperparameter and hyperparameter tuning using Bayesian Optimization. It was found that the Random Forest model using hyperparameter tuning Bayesian Optimization outperformed the other models with accuracy (0.97106), precision (0.97339), recall (0.97185), f-measure (0.97011), and AUC (0.99974) for lung cancer data. These findings highlight Bayesian Optimization for hyperparameter tuning in classification models can improve clinical prediction of lung cancer from patient medical records. The integration of Bayesian Optimization in hyperparameter tuning represents a significant step forward in refining the accuracy and effectiveness of classification models, thus contributing to the ongoing enhancement of medical diagnostics and healthcare strategies.

Downloads

Download data is not yet available.

References

H. Sung et al., “Global Cancer Statistics 2020: GLOBOCAN Estimates of Incidence and Mortality Worldwide for 36 Cancers in 185 Countries,” CA Cancer J Clin, vol. 71, no. 3, pp. 209–249, May 2021, doi: 10.3322/caac.21660.

S. V. S. Deo, J. Sharma, and S. Kumar, “GLOBOCAN 2020 Report on Global Cancer Burden: Challenges and Opportunities for Surgical Oncologists,” Ann Surg Oncol, 2022, doi: 10.1245/s10434-022-12151-6.

K. Tuncal, B. Sekeroglu, and C. Ozkan, “Lung cancer incidence prediction using machine learning algorithms,” Journal of Advances in Information Technology, vol. 11, no. 2, pp. 91–96, May 2020, doi: 10.12720/jait.11.2.91-96.

H. H. Popper, “Progression and metastasis of lung cancer,” Cancer and Metastasis Reviews, vol. 35, no. 1, pp. 75–91, Mar. 2016, doi: 10.1007/s10555-016-9618-0.

L. A. Torre, R. L. Siegel, and A. Jemal, “Lung cancer statistics,” Adv Exp Med Biol, vol. 893, pp. 1–19, 2016, doi: 10.1007/978-3-319-24223-1_1.

W. Li, L. Ah Tse, J. S. K Au, F. Wang, H. Qiu, and I. Tak-sun Yu, “Secondhand smoke enhances lung cancer risk in male smokers: an interaction,” 2016. [Online]. Available: http://ntr.oxfordjournals.org/

X. X. Li, B. Li, L. F. Tian, and L. Zhang, “Automatic benign and malignant classification of pulmonary nodules in thoracic computed tomography based on RF algorithm,” IET Image Process, vol. 12, no. 7, pp. 1253–1264, Jul. 2018, doi: 10.1049/iet-ipr.2016.1014.

S. B. Knight, P. A. Crosbie, H. Balata, J. Chudziak, T. Hussell, and C. Dive, “Progress and prospects of early detection in lung cancer,” Open Biol, vol. 7, no. 9, 2017, doi: 10.1098/rsob.170070.

P. R. Radhika, R. A. Nair, and G. Veena, “A Comparative Study of Lung Cancer Detection Using Machine Learning Algorithms,” in In 2019 IEEE international conference on electrical, computer and communication technologies (ICECCT), 2019, pp. 1–4.

N. Ke, G. Shi, and Y. Zhou, “Stacking model for optimizing subjective well-being predictions based on the cgss database,” Sustainability (Switzerland), vol. 13, no. 21, Nov. 2021, doi: 10.3390/su132111833.

H. Gupta et al., “CATEGORY BOOSTING MACHINE LEARNING ALGORITHM FOR BREAST CANCER PREDICTION,” Rev. Roum. Sci. Techn.-Électrotechn. et Énerg, vol. 66, pp. 201–206, 2021.

D. Dhiyaussalam, A. Wibowo, F. A. Nugroho, E. A. Sarwoko, and I. M. A. Setiawan, “Classification of Headache Disorder Using Random Forest Algorithm,” in ICICoS 2020 - Proceeding: 4th International Conference on Informatics and Computational Sciences, Institute of Electrical and Electronics Engineers Inc., Nov. 2020. doi: 10.1109/ICICoS51170.2020.9299105.

T. L. Octaviani and Z. Rustam, “Random forest for breast cancer prediction,” in AIP Conference Proceedings, American Institute of Physics Inc., Nov. 2019. doi: 10.1063/1.5132477.

E. Dritsas and M. Trigka, “Lung Cancer Risk Prediction with Machine Learning Models,” Big Data and Cognitive Computing, vol. 6, no. 4, Dec. 2022, doi: 10.3390/bdcc6040139.

T. R. Ojha, “Machine Learning based Classification and Detection of Lung Cancer,” Journal of Artificial Intelligence and Capsule Networks, vol. 5, no. 2, pp. 110–128, Jun. 2023, doi: 10.36548/jaicn.2023.2.003.

E. Elgeldawi, A. Sayed, A. R. Galal, and A. M. Zaki, “Hyperparameter tuning for machine learning algorithms used for arabic sentiment analysis,” Informatics, vol. 8, no. 4, Dec. 2021, doi: 10.3390/informatics8040079.

A. Callens, D. Morichon, S. Abadie, M. Delpey, and B. Liquet, “Using Random forest and Gradient boosting trees to improve wave forecast at a specific location,” Applied Ocean Research, vol. 104, Nov. 2020, doi: 10.1016/j.apor.2020.102339.

D. Sun, H. Wen, D. Wang, and J. Xu, “A random forest model of landslide susceptibility mapping based on hyperparameter optimization using Bayes algorithm,” Geomorphology, vol. 362, Aug. 2020, doi: 10.1016/j.geomorph.2020.107201.

J. T. Hancock and T. M. Khoshgoftaar, “Survey on categorical data for neural networks,” J Big Data, vol. 7, no. 1, Dec. 2020, doi: 10.1186/s40537-020-00305-w.

V. García, J. S. Sánchez, A. I. Marqués, R. Florencia, and G. Rivera, “Understanding the apparent superiority of over-sampling through an analysis of local information for class-imbalanced data,” Expert Syst Appl, vol. 158, Nov. 2020, doi: 10.1016/j.eswa.2019.113026.

H. Wei, C. Hu, S. Chen, Y. Xue, and Q. Zhang, “Establishing a software defect prediction model via effective dimension reduction,” Inf Sci (N Y), vol. 477, pp. 399–409, Mar. 2019, doi: 10.1016/j.ins.2018.10.056.

R. T. Yunardi, R. Apsari, and M. Yasin, “Comparison of Machine Learning Algorithm For Urine Glucose Level Classification Using Side-Polished Fiber Sensor,” Journal of Electronics, Electromedical, and Medical Informatics (JEEEMI), vol. 2, no. 2, pp. 33–39, 2020, doi: 10.35882/jeeemi.v2i2.1.

C. M. Lynch et al., “Prediction of lung cancer patient survival via supervised machine learning classification techniques,” Int J Med Inform, vol. 108, pp. 1–8, Dec. 2017, doi: 10.1016/j.ijmedinf.2017.09.013.

S. A. Sontakke, J. Lohokare, R. Dani, and P. Shivagaje, “Classification of cardiotocography signals using machine learning,” in Advances in Intelligent Systems and Computing, Springer Verlag, 2018, pp. 439–450. doi: 10.1007/978-3-030-01057-7_35.

L. Prokhorenkova, G. Gusev, A. Vorobev, A. V. Dorogush, and A. Gulin, “CatBoost: unbiased boosting with categorical features.” [Online]. Available: https://github.com/catboost/catboost

A. V. Dorogush, V. Ershov, and A. Gulin, “CatBoost: gradient boosting with categorical features support,” Oct. 2018, [Online]. Available: http://arxiv.org/abs/1810.11363

P. S. Kumar, K. Anisha Kumari, S. Mohapatra, B. Naik, J. Nayak, and M. Mishra, “CatBoost ensemble approach for diabetes risk prediction at early stages,” in 1st Odisha International Conference on Electrical Power Engineering, Communication and Computing Technology, ODICON 2021, Institute of Electrical and Electronics Engineers Inc., Jan. 2021. doi: 10.1109/ODICON50556.2021.9428943.

Breiman, L., “Random Forests,” Machine Learning, vol. 45, pp. 5-32, 2001.

I. Yoo, J. Bi, X. Hu, National Science Foundation (U.S.), and Institute of Electrical and Electronics Engineers, Proceedings, 2019 IEEE International Conference on Bioinformatics and Biomedicine : November 18-21, 2019, San Diego, CA, USA.

N. Hamid Arif, M. Reza Faisal, A. Farmadi, D. Turianto Nugrahadi, F. Abadi, and U. Ali Ahmad, “A Comparative Study of Machine Learning Models An Approach to ECG-based Gender Recognition Using Random Forest Algorithm,” Journal of Electronics, Electromedical Engineering, and Medical Informatics, vol. 6, no. 2, pp. 107–115, 2024, doi: 10.35882/jeeemi.v6i2.363.

H. B. Kibria and A. Matin, “The Severity Prediction of The Binary And Multi-Class Cardiovascular Disease -- A Machine Learning-Based Fusion Approach,” Mar. 2022, [Online]. Available: http://arxiv.org/abs/2203.04921

S. Bhanumathi and Dr. S. N. Chandrashekara, “Impute, Select, Decision Tree and Naïve Bayes (ISE-DNC): An Ensemble Learning Approach to Classify the Lung Cancer,” 2020. [Online]. Available: https://ssrn.com/abstract=3667438

M. S. Rao, A. Singh, N. V. S. Reddy, and D. U. Acharya, “Crop prediction using machine learning,” in Journal of Physics: Conference Series, IOP Publishing Ltd, Jan. 2022. doi: 10.1088/1742-6596/2161/1/012033.

D. Sun, J. Xu, H. Wen, and D. Wang, “Assessment of landslide susceptibility mapping based on Bayesian hyperparameter optimization: A comparison between logistic regression and random forest,” Eng Geol, vol. 281, Feb. 2021, doi: 10.1016/j.enggeo.2020.105972.

H. Dong, D. He, and F. Wang, “SMOTE-XGBoost using Tree Parzen Estimator optimization for copper flotation method classification,” Powder Technol, vol. 375, pp. 174–181, Sep. 2020, doi: 10.1016/j.powtec.2020.07.065.

A. M. Elshewey, M. Y. Shams, N. El-Rashidy, A. M. Elhady, S. M. Shohieb, and Z. Tarek, “Bayesian Optimization with Support Vector Machine Model for Parkinson Disease Classification,” Sensors, vol. 23, no. 4, Feb. 2023, doi: 10.3390/s23042085.

M. Fawwaz Akbar, M. I. Mazdadi, H. Saragih, and F. Abadi, “Implementation of Information Gain Ratio and Particle Swarm Optimization in the Sentiment Analysis Classification of Covid-19 Vaccine Using Support Vector Machine,” Journal of Electronics, Electromedical Engineering, and Medical Informatics, vol. 5, no. 4, pp. 261–270, 2023, doi: 10.35882/jeemi.v5i4.328.

D. Valero-Carreras, J. Alcaraz, and M. Landete, “Comparing two SVM models through different metrics based on the confusion matrix,” Comput Oper Res, vol. 152, Apr. 2023, doi: 10.1016/j.cor.2022.106131.

N. Banerjee and S. Das, “Prediction Lung Cancer– In Machine Learning Perspective,” 2020 International Conference on Computer Science, Engineering and Applications (ICCSEA), 2020, pp. 1-5, doi: 10.1109/ICCSEA49143.2020.9132913.

S. Napi, T. Hamonangan Saragih, D. Turianto Nugrahadi, D. Kartini, and F. Abadi, “Implementation of Monarch Butterfly Optimization for Feature Selection in Coronary Artery Disease Classification Using Gradient Boosting Decision Tree,” Journal of Electronics, Electromedical Engineering, and Medical Informatics, vol. 5, no. 4, pp. 314–323, 2023, doi: 10.35882/jeemi.v5i4.331.

M. R. Ansyari, M. I. Mazdadi, F. Indriani, D. Kartini, and T. H. Saragih, “Implementation of Random Forest and Extreme Gradient Boosting in the Classification of Heart Disease Using Particle Swarm Optimization Feature Selection,” Journal of Electronics, Electromedical Engineering, and Medical Informatics, vol. 5, no. 4, pp. 250–260, 2023, doi: 10.35882/jeemi.v5i4.322.

Q. Zou, S. Xie, Z. Lin, M. Wu, and Y. Ju, “Finding the Best Classification Threshold in Imbalanced Classification,” Big Data Research, vol. 5, pp. 2–8, Sep. 2016, doi: 10.1016/j.bdr.2015.12.001.