Implementation of C5.0 Algorithm using Chi-Square Feature Selection for Early Detection of Hepatitis C Disease

Mahmud MAHMUD; Irwan  BUDİMAN; Fatma  INDRİANİ; Dwi  KARTİNİ; Mohammad Reza  FAİSAL; Hasri Akbar Awal  ROZAQ; Oktay YILDIZ; Wahyu Caesarendra

doi:10.35882/jeeemi.v6i2.384

Mahmud MAHMUD Faculty of Computer Science, Lambung Mangkurat University, South Kalimantan, Indonesia https://orcid.org/0009-0004-6281-7312
Irwan BUDİMAN Faculty of Computer Science, Lambung Mangkurat University, South Kalimantan, Indonesia https://orcid.org/0000-0002-0514-7429
Fatma INDRİANİ Faculty of Computer Science, Lambung Mangkurat University, South Kalimantan, Indonesia https://orcid.org/0009-0006-7180-6708
Dwi KARTİNİ Faculty of Computer Science, Lambung Mangkurat University, South Kalimantan, Indonesia https://orcid.org/0000-0002-7382-5084
Mohammad Reza FAİSAL Faculty of Computer Science, Lambung Mangkurat University, South Kalimantan, Indonesia https://orcid.org/0000-0001-5748-7639
Hasri Akbar Awal ROZAQ Graduate School of Informatics, Department of Computer Science, Gazi University, Ankara, Turkey https://orcid.org/0000-0001-8007-4963
Oktay YILDIZ Faculty of Engineering, Department of Computer Engineering, Gazi University, Ankara, Turkey https://orcid.org/0000-0001-9155-7426
Wahyu Caesarendra Faculty of Integrated Technologies, Universiti Brunei Darussalam, Bandar Seri Begawan, Brunei Darussalam https://orcid.org/0000-0002-9784-4204

DOI: https://doi.org/10.35882/jeeemi.v6i2.384

Keywords: Hepatitis C Disease, C5.0 Algorithm, Feature Selection, Machine learning

Abstract

Hepatitis C, a significant global health challenge, affects 71 million people worldwide, with severe complications such as cirrhosis and hepatocellular carcinoma. Despite its prevalence and availability in rapid diagnostic tests (RDTs), the need for accurate early detection methods remains critical. This research aims to enhance hepatitis C virus classification accuracy by integrating the C5.0 algorithm with Chi-Square feature selection, addressing the limitations of current diagnostic approaches and potentially reducing diagnostic errors. This research explores the development of a machine learning model for hepatitis C prediction, utilizing a publicly available dataset from Kaggle. It encompasses preprocessing techniques such as label encoding, handling missing values, normalization, feature selection, model development, and evaluation to ensure the model's efficacy and accuracy in diagnosing hepatitis C. The findings of this study reveal that implementing Chi-Square feature selection significantly enhances the effectiveness of machine learning algorithms. Specifically, the combination of the C5.0 algorithm and Chi-Square feature selection yielded a remarkable accuracy of 96.75%, surpassing previous research benchmarks. This highlights the potent synergy between advanced feature selection techniques and machine learning algorithms in improving diagnostic precision. The study conclusively demonstrates that machine learning is an effective tool for detecting hepatitis C, showcasing the potential to enhance diagnostic accuracy significantly. As a future recommendation, adopting AutoML is suggested to periodically automate the selection of the optimal algorithm, promising further improvements in detection capabilities.

Downloads

Download data is not yet available.

References

Alizargar, A., Chang, Y., and Tan, T., “Performance Comparison of Machine Learning Approaches on Hepatitis C Prediction Employing Data Mining Techniques,” MDPI Journals, vol. 10, no. 481, Apr. 2023, doi: bioengineering10040481.

Andeli, N., Lorencin, I., Šegota, S. B., and Ca, Z., “The Development of Symbolic Expressions for the Detection of Hepatitis C Patients and the Disease Progression from Blood Parameters Using Genetic Programming-Symbolic Classification Algorithm,” MDPI Journals, vol. 13, no. 574, Dec. 2022, doi: 13010574.

Sedeno-Monge, V., et al., “A comprehensive update of the status of hepatitis C virus (HCV) infection in Mexico—A systematic review and meta-analysis (2008–2019),” Ann Hepatol, vol. 20, pp. 1–11, Jan. 2021, doi: https://doi.org/10.1016/j.aohep.2020.100292.

Homolak, J., et al., “A Cross-Sectional Study Of Hepatitis B And Hepatitis C Knowledge Among Dental Medicine Students At The University Of Zagreb,” Acta Clin Croat, vol. 60, no. 2, pp. 216–230, Jul. 2021, doi: 10.20471/acc.2021.60.02.07.

Sachdeva, R. K., Bathla., Rani, P., Solanki, V., and Ahuja, R., “A systematic method for diagnosis of hepatitis disease using machine learning,” Innov Syst Softw Eng, vol. 19, no. 3, pp. 71–80, Jan. 2023, doi: https://doi.org/10.1007/s11334-022-00509-8.

ManeI, R., et al., “Evaluation of five rapid diagnostic tests for detection of antibodies to hepatitis C virus (HCV): A step towards scale-up of HCV screening efforts in India,” Plos One Journals, pp. 1–10, Jan. 2019.

Shivkumar, M. S., Peeling, P. R., Jafari, M. Y., Joseph, P. L., and Pai, M. M. P. N. P., “Accuracy of Rapid and Point-of-Care Screening Tests for Hepatitis C,” Ann Intern Med, vol. 157, no. 8, pp. 558–566, Oct. 2012, doi: https://doi.org/10.7326/0003-4819-157-8-201210160-00006.

Leathersa, J. S., et al., “Validation of a point-of-care rapid diagnostic test for hepatitis C for use in resource-limited settings,” Int Health, vol. 11, pp. 314–315, 2019, doi: 10.1093/inthealth/ihy101.

Ibrahim, I. N., et al., “Towards 2030 Target for Hepatitis B and C Viruses Elimination Assessing the Validity of Predonation Rapid Diagnostic Tests versus Enzyme-linked Immunosorbent Assay in State Hospitals in Kaduna, Nigeria,” Nigerian Medical Journal, vol. 60, no. 3, pp. 161–164, Jun. 2019, doi: 10.4103/nmj.NMJ_93_18.

Mahesh, B., “Machine Learning Algorithms - A Review,” International Journal of Science and Research (IJSR), vol. 9, no. 1, pp. 381–386, Oct. 2020.

Jijo, B. T., and Abdulazeez, A. M., “Classification Based on Decision Tree Algorithm for Machine Learning,” Journal Of Applied Science and Technology Trends, vol. 2, no. 1, pp. 20–28, Mar. 2021.

Yağanoğlu, M., “Hepatitis C virus data analysis and prediction using machine learning,” Data Knowl Eng, vol. 142, pp. 101–120, Nov. 2022, doi: https://doi.org/10.1016/j.datak.2022.102087.

Butt, M. B., et al., “Diagnosing the Stage of Hepatitis C Using Machine Learning,” J Healthc Eng, pp. 1–8, Nov. 2021, doi: 10.1155/2021/8062410.

Akella, A., and Akella, S., “Applying Machine Learning to Evaluate for Fibrosis in Chronic Hepatitis C,” medRxiv, Nov. 2020, doi: 11.02.20224840.

Rajeswaria, S., and Suthendran, K., “C5.0: Advanced Decision Tree (ADT) classification model for agricultural data analysis on cloud,” Comput Electron Agric, vol. 156, pp. 530–539, 2019, doi: https://doi.org/10.1016/j.compag.2018.12.013.

K.V, U., and Appavu, B. S., “C5.0 Decision Tree Model Using Tsallis Entropy and Association Function for General and Medical Dataset,” Intelligent Automation And Soft Computing, vol. 26, no. 1, pp. 61–70, 2020, doi: DOI: 10.31209/2019.100000153.

Dalal, S., et al., “A precise coronary artery disease prediction using Boosted C5.0 decision tree model,” Journal of Autonomous Intelligence, vol. 6, no. 3, pp. 1–18, Jul. 2023, doi: 10.32629/jai.v6i3.628.

Ghavidel, A., Pazos, P., Suarez, R. D. A., and Atashi, A., “Predicting the Need for Cardiovascular Surgery: A Comparative Study of Machine Learning Models,” Journal of Electronics, Electromedical Engineering, and Medical Informatics, vol. 6, no. 2, pp. 92–106, Apr. 2024, doi: https://doi.org/10.35882/jeeemi.v6i2.359.

Thakkar, A., and Lohiya, R., “Attack classifcation using feature selection techniques: a comparative study,” J Ambient Intell Humaniz Comput, Jun. 2020, doi: https://doi.org/10.1007/s12652-020-02167-9.

Turhan, N. S., “Karl Pearson’s Chi-Square Tests,” Journal Academic, vol. 15, no. 9, pp. 575–580, Sep. 2020, doi: 10.5897/ERR2019.3817.

Spencer, R., Thabtah, F., Abdelhamid, N., and Thompson, M., “Exploring feature selection and classification methods for predicting heart disease,” Digit Health, vol. 6, pp. 1–10, Dec. 2020, doi: https://doi.org/10.1177/2055207620914777.

Rosidin, S., Muljono, Shidik, G. F., Fanani, A. Z., Zami, F. A., and Purwanto, “Improvement with Chi Square Selection Feature using Supervised Machine Learning Approach on Covid-19 Data,” International Seminar on Application for Technology of Information and Communication (iSemantic), Oct. 2021, doi: 10.1109/iSemantic52711.2021.9573196.

Fedesoriano, “Hepatitis C Prediction Dataset,” Kaggle. Accessed: Mar. 17, 2024. [Online]. Available: https://www.kaggle.com/datasets/fedesoriano/hepatitis-c-dataset

Safdari, R., Deghatipour, A., Gholamzadeh, M., and Maghooli, K., “Applying data mining techniques to classify patients with suspected hepatitis C virus infection,” Intelligent Medicine, Dec. 2021, doi: , 10.1016/j.imed.2021.12.003.

Sailasya, G., and Kumari, G. L. A., “Analyzing the Performance of Stroke Prediction using ML Classification Algorithms,” (IJACSA) International Journal of Advanced Computer Science and Applications, vol. 12, no. 6, pp. 539–545, 2021.

Hancock, J. T., and Khoshgoftaar, T. M., “Survey on categorical data for neural networks,” Journal Big Data, vol. 7, no. 28, pp. 1–41, 2020, doi: https://doi.org/10.1186/s40537-020-00305-w.

Johnson, T. F., Isaac, N. J. B., Paviolo, A., and González-Suárez, M., “Handling missing values in trait data,” Global Ecology and Biogeography, vol. 30, no. 1, pp. 51–62, Aug. 2021, doi: https://doi.org/10.1111/geb.13185.

Emmanuel, T., Maupong, T., Mpoeleng, D., Semong, B. M. T., and Tabona, O., “A survey on missing data in machine learning,” J Big Data, vol. 8, no. 140, Oct. 2021, doi: https://doi.org/10.1186/s40537-021-00516-9.

Singh, D., and Singh, B., “Investigating the impact of data normalization on classification performance,” Applied Soft Computing Journal, pp. 1–23, 2019, doi: https://doi.org/10.1016/j.asoc.2019.105524.

Kappal, S., “Data Normalization using Median & Median Absolute Deviation (MMAD) based Z-Score for Robust Predictions vs. Min – Max Normalization,” London Journal of Research in Science: Natural and Formal, vol. 19, no. 4, pp. 39–44, 2019.

Uma, D. K. V., Padmaja, P. J., and Vinoodhini, D., “Stacked Feature Selection and C5.0 Classification Model with Tsallis Entropy for Medical Dataset,” Journal of Pharmaceutical Negative Results , vol. 13, no. 2, pp. 393–399, 2022.

Ray, S., Alshouiliy, K., Roy, A., AlGhamdi, A., and Agrawal, D. P., “Chi-Squared Based Feature Selection for Stroke Prediction using AzureML ,” Intermountain Engineering, Technology and Computing (IETC), Dec. 2020, doi: 10.1109/IETC47856.2020.9249117.

Tian, J., and Zhang, J., “Breast cancer diagnosis using feature extraction and boosted C5.0 decision tree algorithm with penalty factor,” Mathematical Biosciences and Engineering, vol. 19, no. 3, pp. 2193–2205, Jan. 2022.

Ayinla, I. B., and Akinola, S. O., “An Improved Collaborative Pruning Using Ant Colony Optimization and Pessimistic Technique of C5.0 Decision Tree Algorithm,” International Journal of Computer Science and Information Security (IJCSIS), vol. 18, no. 12, pp. 111–123, Dec. 2020, doi: https://doi.org/10.5281/zenodo.4427699.

Widyananda, W., Purnomo, M. F. E., Aswin, M., Mudjirahardjo, P., and Pramono, S. H., “Dataset Missing Value Handling And Classification Using Decision Tree C5.0 And K-Nn Imputation: Study Case Car Evaluation Dataset,” J Theor Appl Inf Technol, vol. 100, no. 12, pp. 4503–4512, Jun. 2022.

Pathan, P. S. S., “An Approach to Decision Tree Induction for Classification,” Turkish Journal of Computer and Mathematics Education , vol. 12, no. 12, pp. 919–928, May 2021.

Badr, S. M., “Adaptive Layered Approach using C5.0 Decision Tree for Intrusion Detection Systems (ALIDS),” Int J Comput Appl, vol. 66, no. 22, pp. 18–22, Mar. 2013.

Naji, M. A., Filali, S. E., Aarika, K., Benlahmar, E. H., Abdelouhahid, R. A., and Debauche, O., “Machine Learning Algorithms For Breast Cancer Prediction And Diagnosis,” Procedia Comput Sci, vol. 191, pp. 487–492, Aug. 2021, doi: 10.1016/j.procs.2021.07.062.

Krstinić, D., Braović, M., Šerić, L., and Božić-Štulić, D., “Multi-Label Classifier Performance Evaluation With Confusion Matrix,” Computer Science & Information Technology (CS & IT), pp. 1–14, 2020, doi: 10.5121/csit.2020.100801.

Arif, N. H., Faisal, M. R., Farmadi, A., Nugrahadi, D. T., Abadi, F., and Ahmad, U. A., “An Approach to ECG-based Gender Recognition Using Random Forest Algorithm,” Journal of Electronics, Electromedical Engineering, and Medical Informatics, vol. 6, no. 2, pp. 107–115, Apr. 2024, doi: https://doi.org/10.35882/jeeemi.v6i2.363.

Rodriguez, D., Herraiz, I., Harrison, R., Dolado, J., and Riquelme, J. C., “Preliminary Comparison of Techniques for Dealing with Imbalance in Software Defect Prediction,” Proc. 18th Int. Conf. Eval. Assess. Softw. Eng. - EASE ’14, vol. 43, pp. 1–10, May 2014, doi: https://doi.org/10.1145/2601248.2601294.

Sookoian, S., and Pirola, C. J., “Liver enzymes, metabolomics and genome-wide association studies: From systems biology to the personalized medicine,” World J Gastroenterol , vol. 21, no. 3, pp. 711–725, Jan. 2015, doi: 10.3748/wjg.v21.i3.711.

Nivaan, G. V., and Emanuel, A. W. R., “Analytic Predictive of Hepatitis using The Regression Logic Algorithm,” IEEE, pp. 106–110, Jan. 2021, doi: 10.1109/ISRITI51436.2020.9315365.

Hashem, S., et al., “Comparison of Machine Learning Approaches for Prediction of Advanced Liver Fibrosis in Chronic Hepatitis C Patients,” IEEE/ACM Trans Comput Biol Bioinform, 2020.

Yulhendri, Malabay, and Kartini, “Correlated Naïve Bayes Algorithm To Determine Healing Rate Of Hepatitis C Patients,” International Journal of Science, Technology & Management, vol. 4, no. 2, pp. 401–410, Mar. 2023.

Ling Ma, YongSheng Yang, Xin Ge, YiDan Wan, and Xin Sang, “Prediction of disease progression of chronic hepatitis C based on XGBoost algorithm,” International Conference on Robots & Intelligent System (ICRIS), Nov. 2020, doi: 10.1109/ICRIS52159.2020.00151.

Farooq, S. A., “The Multi-Class Detection of Five Stages of Hepatitis C using the Machine Learning based Random Forest Algorithm,” 2023 World Conference on Communication & Computing (WCONF), Jul. 2023, doi: 10.1109/WCONF58270.2023.10235157.