Implementation of C5.0 Algorithm using Chi-Square Feature Selection for Early Detection of Hepatitis C Disease
Abstract
Hepatitis C, a significant global health challenge, affects 71 million people worldwide, with severe complications such as cirrhosis and hepatocellular carcinoma. Despite its prevalence and availability in rapid diagnostic tests (RDTs), the need for accurate early detection methods remains critical. This research aims to enhance hepatitis C virus classification accuracy by integrating the C5.0 algorithm with Chi-Square feature selection, addressing the limitations of current diagnostic approaches and potentially reducing diagnostic errors. This research explores the development of a machine learning model for hepatitis C prediction, utilizing a publicly available dataset from Kaggle. It encompasses preprocessing techniques such as label encoding, handling missing values, normalization, feature selection, model development, and evaluation to ensure the model's efficacy and accuracy in diagnosing hepatitis C. The findings of this study reveal that implementing Chi-Square feature selection significantly enhances the effectiveness of machine learning algorithms. Specifically, the combination of the C5.0 algorithm and Chi-Square feature selection yielded a remarkable accuracy of 96.75%, surpassing previous research benchmarks. This highlights the potent synergy between advanced feature selection techniques and machine learning algorithms in improving diagnostic precision. The study conclusively demonstrates that machine learning is an effective tool for detecting hepatitis C, showcasing the potential to enhance diagnostic accuracy significantly. As a future recommendation, adopting AutoML is suggested to periodically automate the selection of the optimal algorithm, promising further improvements in detection capabilities.
Downloads
References
Alizargar, A., Chang, Y., and Tan, T., “Performance Comparison of Machine Learning Approaches on Hepatitis C Prediction Employing Data Mining Techniques,” MDPI Journals, vol. 10, no. 481, Apr. 2023, doi: bioengineering10040481.
Andeli, N., Lorencin, I., Šegota, S. B., and Ca, Z., “The Development of Symbolic Expressions for the Detection of Hepatitis C Patients and the Disease Progression from Blood Parameters Using Genetic Programming-Symbolic Classification Algorithm,” MDPI Journals, vol. 13, no. 574, Dec. 2022, doi: 13010574.
Sedeno-Monge, V., et al., “A comprehensive update of the status of hepatitis C virus (HCV) infection in Mexico—A systematic review and meta-analysis (2008–2019),” Ann Hepatol, vol. 20, pp. 1–11, Jan. 2021, doi: https://doi.org/10.1016/j.aohep.2020.100292.
Homolak, J., et al., “A Cross-Sectional Study Of Hepatitis B And Hepatitis C Knowledge Among Dental Medicine Students At The University Of Zagreb,” Acta Clin Croat, vol. 60, no. 2, pp. 216–230, Jul. 2021, doi: 10.20471/acc.2021.60.02.07.
Sachdeva, R. K., Bathla., Rani, P., Solanki, V., and Ahuja, R., “A systematic method for diagnosis of hepatitis disease using machine learning,” Innov Syst Softw Eng, vol. 19, no. 3, pp. 71–80, Jan. 2023, doi: https://doi.org/10.1007/s11334-022-00509-8.
ManeI, R., et al., “Evaluation of five rapid diagnostic tests for detection of antibodies to hepatitis C virus (HCV): A step towards scale-up of HCV screening efforts in India,” Plos One Journals, pp. 1–10, Jan. 2019.
Shivkumar, M. S., Peeling, P. R., Jafari, M. Y., Joseph, P. L., and Pai, M. M. P. N. P., “Accuracy of Rapid and Point-of-Care Screening Tests for Hepatitis C,” Ann Intern Med, vol. 157, no. 8, pp. 558–566, Oct. 2012, doi: https://doi.org/10.7326/0003-4819-157-8-201210160-00006.
Leathersa, J. S., et al., “Validation of a point-of-care rapid diagnostic test for hepatitis C for use in resource-limited settings,” Int Health, vol. 11, pp. 314–315, 2019, doi: 10.1093/inthealth/ihy101.
Ibrahim, I. N., et al., “Towards 2030 Target for Hepatitis B and C Viruses Elimination Assessing the Validity of Predonation Rapid Diagnostic Tests versus Enzyme-linked Immunosorbent Assay in State Hospitals in Kaduna, Nigeria,” Nigerian Medical Journal, vol. 60, no. 3, pp. 161–164, Jun. 2019, doi: 10.4103/nmj.NMJ_93_18.
Mahesh, B., “Machine Learning Algorithms - A Review,” International Journal of Science and Research (IJSR), vol. 9, no. 1, pp. 381–386, Oct. 2020.
Jijo, B. T., and Abdulazeez, A. M., “Classification Based on Decision Tree Algorithm for Machine Learning,” Journal Of Applied Science and Technology Trends, vol. 2, no. 1, pp. 20–28, Mar. 2021.
Yağanoğlu, M., “Hepatitis C virus data analysis and prediction using machine learning,” Data Knowl Eng, vol. 142, pp. 101–120, Nov. 2022, doi: https://doi.org/10.1016/j.datak.2022.102087.
Butt, M. B., et al., “Diagnosing the Stage of Hepatitis C Using Machine Learning,” J Healthc Eng, pp. 1–8, Nov. 2021, doi: 10.1155/2021/8062410.
Akella, A., and Akella, S., “Applying Machine Learning to Evaluate for Fibrosis in Chronic Hepatitis C,” medRxiv, Nov. 2020, doi: 11.02.20224840.
Rajeswaria, S., and Suthendran, K., “C5.0: Advanced Decision Tree (ADT) classification model for agricultural data analysis on cloud,” Comput Electron Agric, vol. 156, pp. 530–539, 2019, doi: https://doi.org/10.1016/j.compag.2018.12.013.
K.V, U., and Appavu, B. S., “C5.0 Decision Tree Model Using Tsallis Entropy and Association Function for General and Medical Dataset,” Intelligent Automation And Soft Computing, vol. 26, no. 1, pp. 61–70, 2020, doi: DOI: 10.31209/2019.100000153.
Dalal, S., et al., “A precise coronary artery disease prediction using Boosted C5.0 decision tree model,” Journal of Autonomous Intelligence, vol. 6, no. 3, pp. 1–18, Jul. 2023, doi: 10.32629/jai.v6i3.628.
Ghavidel, A., Pazos, P., Suarez, R. D. A., and Atashi, A., “Predicting the Need for Cardiovascular Surgery: A Comparative Study of Machine Learning Models,” Journal of Electronics, Electromedical Engineering, and Medical Informatics, vol. 6, no. 2, pp. 92–106, Apr. 2024, doi: https://doi.org/10.35882/jeeemi.v6i2.359.
Thakkar, A., and Lohiya, R., “Attack classifcation using feature selection techniques: a comparative study,” J Ambient Intell Humaniz Comput, Jun. 2020, doi: https://doi.org/10.1007/s12652-020-02167-9.
Turhan, N. S., “Karl Pearson’s Chi-Square Tests,” Journal Academic, vol. 15, no. 9, pp. 575–580, Sep. 2020, doi: 10.5897/ERR2019.3817.
Spencer, R., Thabtah, F., Abdelhamid, N., and Thompson, M., “Exploring feature selection and classification methods for predicting heart disease,” Digit Health, vol. 6, pp. 1–10, Dec. 2020, doi: https://doi.org/10.1177/2055207620914777.
Rosidin, S., Muljono, Shidik, G. F., Fanani, A. Z., Zami, F. A., and Purwanto, “Improvement with Chi Square Selection Feature using Supervised Machine Learning Approach on Covid-19 Data,” International Seminar on Application for Technology of Information and Communication (iSemantic), Oct. 2021, doi: 10.1109/iSemantic52711.2021.9573196.
Fedesoriano, “Hepatitis C Prediction Dataset,” Kaggle. Accessed: Mar. 17, 2024. [Online]. Available: https://www.kaggle.com/datasets/fedesoriano/hepatitis-c-dataset
Safdari, R., Deghatipour, A., Gholamzadeh, M., and Maghooli, K., “Applying data mining techniques to classify patients with suspected hepatitis C virus infection,” Intelligent Medicine, Dec. 2021, doi: , 10.1016/j.imed.2021.12.003.
Sailasya, G., and Kumari, G. L. A., “Analyzing the Performance of Stroke Prediction using ML Classification Algorithms,” (IJACSA) International Journal of Advanced Computer Science and Applications, vol. 12, no. 6, pp. 539–545, 2021.
Hancock, J. T., and Khoshgoftaar, T. M., “Survey on categorical data for neural networks,” Journal Big Data, vol. 7, no. 28, pp. 1–41, 2020, doi: https://doi.org/10.1186/s40537-020-00305-w.
Johnson, T. F., Isaac, N. J. B., Paviolo, A., and González-Suárez, M., “Handling missing values in trait data,” Global Ecology and Biogeography, vol. 30, no. 1, pp. 51–62, Aug. 2021, doi: https://doi.org/10.1111/geb.13185.
Emmanuel, T., Maupong, T., Mpoeleng, D., Semong, B. M. T., and Tabona, O., “A survey on missing data in machine learning,” J Big Data, vol. 8, no. 140, Oct. 2021, doi: https://doi.org/10.1186/s40537-021-00516-9.
Singh, D., and Singh, B., “Investigating the impact of data normalization on classification performance,” Applied Soft Computing Journal, pp. 1–23, 2019, doi: https://doi.org/10.1016/j.asoc.2019.105524.
Kappal, S., “Data Normalization using Median & Median Absolute Deviation (MMAD) based Z-Score for Robust Predictions vs. Min – Max Normalization,” London Journal of Research in Science: Natural and Formal, vol. 19, no. 4, pp. 39–44, 2019.
Uma, D. K. V., Padmaja, P. J., and Vinoodhini, D., “Stacked Feature Selection and C5.0 Classification Model with Tsallis Entropy for Medical Dataset,” Journal of Pharmaceutical Negative Results , vol. 13, no. 2, pp. 393–399, 2022.
Ray, S., Alshouiliy, K., Roy, A., AlGhamdi, A., and Agrawal, D. P., “Chi-Squared Based Feature Selection for Stroke Prediction using AzureML ,” Intermountain Engineering, Technology and Computing (IETC), Dec. 2020, doi: 10.1109/IETC47856.2020.9249117.
Tian, J., and Zhang, J., “Breast cancer diagnosis using feature extraction and boosted C5.0 decision tree algorithm with penalty factor,” Mathematical Biosciences and Engineering, vol. 19, no. 3, pp. 2193–2205, Jan. 2022.
Ayinla, I. B., and Akinola, S. O., “An Improved Collaborative Pruning Using Ant Colony Optimization and Pessimistic Technique of C5.0 Decision Tree Algorithm,” International Journal of Computer Science and Information Security (IJCSIS), vol. 18, no. 12, pp. 111–123, Dec. 2020, doi: https://doi.org/10.5281/zenodo.4427699.
Widyananda, W., Purnomo, M. F. E., Aswin, M., Mudjirahardjo, P., and Pramono, S. H., “Dataset Missing Value Handling And Classification Using Decision Tree C5.0 And K-Nn Imputation: Study Case Car Evaluation Dataset,” J Theor Appl Inf Technol, vol. 100, no. 12, pp. 4503–4512, Jun. 2022.
Pathan, P. S. S., “An Approach to Decision Tree Induction for Classification,” Turkish Journal of Computer and Mathematics Education , vol. 12, no. 12, pp. 919–928, May 2021.
Badr, S. M., “Adaptive Layered Approach using C5.0 Decision Tree for Intrusion Detection Systems (ALIDS),” Int J Comput Appl, vol. 66, no. 22, pp. 18–22, Mar. 2013.
Naji, M. A., Filali, S. E., Aarika, K., Benlahmar, E. H., Abdelouhahid, R. A., and Debauche, O., “Machine Learning Algorithms For Breast Cancer Prediction And Diagnosis,” Procedia Comput Sci, vol. 191, pp. 487–492, Aug. 2021, doi: 10.1016/j.procs.2021.07.062.
Krstinić, D., Braović, M., Šerić, L., and Božić-Štulić, D., “Multi-Label Classifier Performance Evaluation With Confusion Matrix,” Computer Science & Information Technology (CS & IT), pp. 1–14, 2020, doi: 10.5121/csit.2020.100801.
Arif, N. H., Faisal, M. R., Farmadi, A., Nugrahadi, D. T., Abadi, F., and Ahmad, U. A., “An Approach to ECG-based Gender Recognition Using Random Forest Algorithm,” Journal of Electronics, Electromedical Engineering, and Medical Informatics, vol. 6, no. 2, pp. 107–115, Apr. 2024, doi: https://doi.org/10.35882/jeeemi.v6i2.363.
Rodriguez, D., Herraiz, I., Harrison, R., Dolado, J., and Riquelme, J. C., “Preliminary Comparison of Techniques for Dealing with Imbalance in Software Defect Prediction,” Proc. 18th Int. Conf. Eval. Assess. Softw. Eng. - EASE ’14, vol. 43, pp. 1–10, May 2014, doi: https://doi.org/10.1145/2601248.2601294.
Sookoian, S., and Pirola, C. J., “Liver enzymes, metabolomics and genome-wide association studies: From systems biology to the personalized medicine,” World J Gastroenterol , vol. 21, no. 3, pp. 711–725, Jan. 2015, doi: 10.3748/wjg.v21.i3.711.
Nivaan, G. V., and Emanuel, A. W. R., “Analytic Predictive of Hepatitis using The Regression Logic Algorithm,” IEEE, pp. 106–110, Jan. 2021, doi: 10.1109/ISRITI51436.2020.9315365.
Hashem, S., et al., “Comparison of Machine Learning Approaches for Prediction of Advanced Liver Fibrosis in Chronic Hepatitis C Patients,” IEEE/ACM Trans Comput Biol Bioinform, 2020.
Yulhendri, Malabay, and Kartini, “Correlated Naïve Bayes Algorithm To Determine Healing Rate Of Hepatitis C Patients,” International Journal of Science, Technology & Management, vol. 4, no. 2, pp. 401–410, Mar. 2023.
Ling Ma, YongSheng Yang, Xin Ge, YiDan Wan, and Xin Sang, “Prediction of disease progression of chronic hepatitis C based on XGBoost algorithm,” International Conference on Robots & Intelligent System (ICRIS), Nov. 2020, doi: 10.1109/ICRIS52159.2020.00151.
Farooq, S. A., “The Multi-Class Detection of Five Stages of Hepatitis C using the Machine Learning based Random Forest Algorithm,” 2023 World Conference on Communication & Computing (WCONF), Jul. 2023, doi: 10.1109/WCONF58270.2023.10235157.
Copyright (c) 2024 Mahmud Mahmud, Irwan Budiman, Fatma İndriani, Dwi Kartini, Mohammad Reza Faisal, Hasri Akbar Awal Rozaq, Oktay Yildiz
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution-ShareAlikel 4.0 International (CC BY-SA 4.0) that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).