Comparative Analysis of Hepatitis C virus Genotype 1a (Isolate 1) using Multiple Regression Algorithms and Fingerprinting Techniques
Abstract
Approximately 70 million people worldwide have been infected with Hepatitis C virus (HCV), presenting a critical global health challenge. As a member of the Flaviviridae family, HCV can cause severe liver diseases such as cirrhosis, acute hepatitis, and chronic hepatitis. The Hepatitis C virus (HCV) genome encodes a single polyprotein consisting of 3010 amino acids, which when processed contains 10 polypeptides derived from cellular and viral proteases. These include structural proteins such as core protein, E1 and E2 envelope glycoproteins, and nonstructural proteins such as NS1, NS2, NS3, NS4A, NS4B, NS5A, and NS5B. Nonstructural proteins will be released by HCV NS2-3 and NS3-4A proteases, however, structural proteins will be released by host ER signaling peptidases. co-translationally and post-translationally form 10 individual structural proteins: 5'-C-E1-E2-p7-NS2-NS3-NS4A-NS4B-NS5A-NS5B-3'. Despite extensive research, there are significant gaps in predictive and analytical approaches to managing HCV, particularly in understanding the polyprotein structure and its implications for drug discovery. This study addresses these gaps by employing machine learning techniques to analyze HCV polyprotein using various fingerprinting methods and regression algorithms. The data was sourced from the ChEMBL database, and fingerprinting techniques such as PubChem, MACCS, and E-State were utilized. Regression algorithms, including Gradient Boosting Regression (GBR), Random Forest Regression (RFR), AdaBoost Regression (ABR), and Hist Gradient Boosting Regression (HSR), were applied. Model performance was evaluated using R² and Adjusted R² metrics, comparing default models with those enhanced by hyperparameter tuning. Feature importance analysis was conducted to identify key features influencing model performance, aiding in model simplification. The results show that although hyperparameter tuning does not significantly improve the predictive power of a model, it can provide an insight into model optimization. In particular, the default model showed higher R² and Adjusted R² values across different fingerprinting techniques compared to models with hyperparameterized features. Gradient Boosting Regression (GBR) and Random Forest Regression (RFR) consistently performed well, with GBR showing the highest R² values when using PubChem fingerprints. Although there was no significant improvement through hyperparameter tuning, this study was able to find out the features that strongly influenced the model performance by conducting a feature importance analysis. This analysis helped simplify the model and highlighted the potential of machine learning in improving the understanding of HCV polyprotein structure. This research identifies optimal regression models and fingerprinting techniques, providing a strong framework for future drug discovery efforts aimed at improving global health outcomes. The research also shows that it is important to date to advance drug discovery using machine learning.
Downloads
References
J. Dubuisson, “Hepatitis C virus proteins,” World J Gastroenterol, vol. 13, no. 17, p. 2406, 2007, doi: 10.3748/wjg.v13.i17.2406.
D. Chigbu, R. Loonawat, M. Sehgal, D. Patel, and P. Jain, “Hepatitis C Virus Infection: Host–Virus Interaction and Mechanisms of Viral Persistence,” Cells, vol. 8, no. 4, p. 376, Apr. 2019, doi: 10.3390/cells8040376.
A. Petruzziello, S. Marigliano, G. Loquercio, A. Cozzolino, and C. Cacciapuoti, “Global epidemiology of hepatitis C virus infection: An up-date of the distribution and circulation of hepatitis C virus genotypes,” World J Gastroenterol, vol. 22, no. 34, p. 7824, 2016, doi: 10.3748/wjg.v22.i34.7824.
M. B. Butt et al., “Diagnosing the Stage of Hepatitis C Using Machine Learning,” J Healthc Eng, vol. 2021, pp. 1–8, Dec. 2021, doi: 10.1155/2021/8062410.
H. Razavi et al., “Chronic hepatitis C virus (HCV) disease burden and cost in the United States,” Hepatology, vol. 57, no. 6, pp. 2164–2170, Jun. 2013, doi: 10.1002/hep.26218.
H. Wedemeyer et al., “Strategies to manage hepatitis
N. K. Martin, M. Hickman, S. J. Hutchinson, D. J. Goldberg, and P. Vickerman, “Combination Interventions to Prevent HCV Transmission Among People Who Inject Drugs: Modeling the Impact of Antiviral Treatment, Needle and Syringe Programs, and Opiate Substitution Therapy,” Clinical Infectious Diseases, vol. 57, no. suppl_2, pp. S39–S45, Aug. 2013, doi: 10.1093/cid/cit296.
A. Grakoui, C. Wychowski, C. Lin, S. M. Feinstone, and C. M. Rice, “Expression and identification of hepatitis C virus polyprotein cleavage products,” J Virol, vol. 67, no. 3, pp. 1385–1395, Mar. 1993, doi: 10.1128/jvi.67.3.1385-1395.1993.
M. A. Konerman, Y. Zhang, J. Zhu, P. D. R. Higgins, A. S. F. Lok, and A. K. Waljee, “Improvement of predictive models of risk of disease progression in chronic hepatitis C by incorporating longitudinal data,” Hepatology, vol. 61, no. 6, pp. 1832–1841, Jun. 2015, doi: 10.1002/hep.27750.
M. A. Konerman, S. Yapali, and A. S. Lok, “Systematic review: identifying patients with chronic hepatitis C in need of early treatment and intensive monitoring – predictors and predictive models of disease progression,” Aliment Pharmacol Ther, vol. 40, no. 8, pp. 863–879, Oct. 2014, doi: 10.1111/apt.12921.
F. Penin, J. Dubuisson, F. A. Rey, D. Moradpour, and J.-M. Pawlotsky, “Structural biology of hepatitis C virus,” Hepatology, vol. 39, no. 1, pp. 5–19, Jan. 2004, doi: 10.1002/hep.20032.
D. Pascut, M. Hoang, N. N. Q. Nguyen, M. Y. Pratama, and C. Tiribelli, “HCV Proteins Modulate the Host Cell miRNA Expression Contributing to Hepatitis C Pathogenesis and Hepatocellular Carcinoma Development,” Cancers (Basel), vol. 13, no. 10, p. 2485, May 2021, doi: 10.3390/cancers13102485.
K. E. Reed and C. M. Rice, “Overview of Hepatitis C Virus Genome Structure, Polyprotein Processing, and Protein Properties,” 2000, pp. 55–84. doi: 10.1007/978-3-642-59605-6_4.
Y. Zhang, X. Zhao, J. Zou, Z. Yuan, and Z. Yi, “Dual role of the amphipathic helix of hepatitis C virus NS5A in the viral polyprotein cleavage and replicase assembly,” Virology, vol. 535, pp. 283–296, Sep. 2019, doi: 10.1016/j.virol.2019.07.017.
S. Barik, “Suppression of Innate Immunity by the Hepatitis C Virus (HCV): Revisiting the Specificity of Host–Virus Interactive Pathways,” Int J Mol Sci, vol. 24, no. 22, p. 16100, Nov. 2023, doi: 10.3390/ijms242216100.
R. Khandia, A. A. Khan, N. Karuvantevida, P. Gurjar, I. V. Rzhepakovsky, and I. Legaz, “Insights into Synonymous Codon Usage Bias in Hepatitis C Virus and Its Adaptation to Hosts,” Pathogens, vol. 12, no. 2, p. 325, Feb. 2023, doi: 10.3390/pathogens12020325.
J. McLauchlan, “Intramembrane proteolysis promotes trafficking of hepatitis C virus core protein to lipid droplets,” EMBO J, vol. 21, no. 15, pp. 3980–3988, Aug. 2002, doi: 10.1093/emboj/cdf414.
A. D. Branch, D. D. Stump, J. A. Gutierrez, F. Eng, and J. L. Walewski, “The Hepatitis C Virus Alternate Reading Frame (ARF) and Its Family of Novel Products: The Alternate Reading Frame Protein/F-Protein, the Double-Frameshift Protein, and Others,” Semin Liver Dis, vol. 25, no. 01, pp. 105–117, Feb. 2005, doi: 10.1055/s-2005-864786.
K. Lin, “Development of novel antiviral therapies for hepatitis C virus,” Virol Sin, vol. 25, no. 4, pp. 246–266, Aug. 2010, doi: 10.1007/s12250-010-3140-2.
C. Granchi, “Biological Activity of Natural and Synthetic Compounds,” Molecules, vol. 27, no. 12, p. 3652, Jun. 2022, doi: 10.3390/molecules27123652.
A. S. Verkman, “Drug discovery in academia,” American Journal of Physiology-Cell Physiology, vol. 286, no. 3, pp. C465–C474, Mar. 2004, doi: 10.1152/ajpcell.00397.2003.
A. Roy, “Early Probe and Drug Discovery in Academia: A Minireview,” High Throughput, vol. 7, no. 1, p. 4, Feb. 2018, doi: 10.3390/ht7010004.
J. Hughes, S. Rees, S. Kalindjian, and K. Philpott, “Principles of early drug discovery,” Br J Pharmacol, vol. 162, no. 6, pp. 1239–1249, Mar. 2011, doi: 10.1111/j.1476-5381.2010.01127.x.
A. Varnek and I. Baskin, “Machine Learning Methods for Property Prediction in Chemoinformatics: Quo Vadis ?,” J Chem Inf Model, vol. 52, no. 6, pp. 1413–1437, Jun. 2012, doi: 10.1021/ci200409x.
S. M. Ali, M. Z. Hoemann, , Jeffrey Aubé, G. I. Georg, L. A. Mitscher, and L. R. Jayasinghe, “Butitaxel Analogues: Synthesis and Structure−Activity Relationships,” J Med Chem, vol. 40, no. 2, pp. 236–241, Jan. 1997, doi: 10.1021/jm960505t.
A. Raj, “A Review on Machine Learning Algorithms,” Int J Res Appl Sci Eng Technol, vol. 7, no. 6, pp. 792–796, Jun. 2019, doi: 10.22214/ijraset.2019.6138.
I. R. Hardini, “A Survey on Machine learning and IoT,” ITEJ (Information Technology Engineering Journals), vol. 4, no. 2, pp. 99–113, Dec. 2019, doi: 10.24235/itej.v4i2.51.
R. R. Reddy, C. Mamatha, and R. G. Reddy, “A Review on Machine Learning Trends, Application and Challenges in Internet of Things,” in 2018 International Conference on Advances in Computing, Communications and Informatics (ICACCI), IEEE, Sep. 2018, pp. 2389–2397. doi: 10.1109/ICACCI.2018.8554800.
P. K. Donepudi, “Automation and Machine Learning in Transforming the Financial Industry,” Asian Business Review, vol. 9, no. 3, pp. 129–138, 2019, doi: 10.18034/abr.v9i3.494.
C. Zhao et al., “Multiscale Construction of Bifunctional Electrocatalysts for Long‐Lifespan Rechargeable Zinc–Air Batteries,” Adv Funct Mater, vol. 30, no. 36, Sep. 2020, doi: 10.1002/adfm.202003619.
C. Deng, X. Ji, C. Rainey, J. Zhang, and W. Lu, “Integrating Machine Learning with Human Knowledge,” iScience, vol. 23, no. 11, p. 101656, Nov. 2020, doi: 10.1016/j.isci.2020.101656.
D. F. Sengkey and A. Masengi, “Regression Algorithms in Predicting the SARS-CoV-2 Replicase Polyprotein 1ab Inhibitor: A Comparative Study,” Journal of Electronics, Electromedical Engineering, and Medical Informatics, vol. 6, no. 1, pp. 1–10, Dec. 2023, doi: 10.35882/jeeemi.v6i1.338.
M. J. Willemink et al., “Preparing Medical Imaging Data for Machine Learning,” Radiology, vol. 295, no. 1, pp. 4–15, Apr. 2020, doi: 10.1148/radiol.2020192224.
F. Ridzuan and W. M. N. Wan Zainon, “A Review on Data Cleansing Methods for Big Data,” Procedia Comput Sci, vol. 161, pp. 731–738, 2019, doi: 10.1016/j.procs.2019.11.177.
H. Kuwahara and X. Gao, “Analysis of the effects of related fingerprints on molecular similarity using an eigenvalue entropy approach,” J Cheminform, vol. 13, no. 1, p. 27, Dec. 2021, doi: 10.1186/s13321-021-00506-2.
E. Fernández-de Gortari, C. R. García-Jacas, K. Martinez-Mayorga, and J. L. Medina-Franco, “Database fingerprint (DFP): an approach to represent molecular databases,” J Cheminform, vol. 9, no. 1, p. 9, Dec. 2017, doi: 10.1186/s13321-017-0195-1.
D. Boldini, D. Ballabio, V. Consonni, R. Todeschini, F. Grisoni, and S. A. Sieber, “Effectiveness of molecular fingerprints for exploring the chemical space of natural products,” J Cheminform, vol. 16, no. 1, p. 35, Mar. 2024, doi: 10.1186/s13321-024-00830-3.
M. Saarela and S. Jauhiainen, “Comparison of feature importance measures as explanations for classification models,” SN Appl Sci, vol. 3, no. 2, p. 272, Feb. 2021, doi: 10.1007/s42452-021-04148-9.
S. Shaikh, J. Gala, A. Jain, S. Advani, S. Jaidhara, and M. Roja Edinburgh, “Analysis and Prediction of COVID-19 using Regression Models and Time Series Forecasting,” in 2021 11th International Conference on Cloud Computing, Data Science & Engineering (Confluence), IEEE, Jan. 2021, pp. 989–995. doi: 10.1109/Confluence51648.2021.9377137.
D. Moradpour and F. Penin, “Hepatitis C Virus Proteins: From Structure to Function,” 2013, pp. 113–142. doi: 10.1007/978-3-642-27340-7_5.
M. Golizeh et al., “Proteomic fingerprinting in HIV/HCV co-infection reveals serum biomarkers for the diagnosis of fibrosis staging,” PLoS One, vol. 13, no. 4, p. e0195148, Apr. 2018, doi: 10.1371/journal.pone.0195148.
Copyright (c) 2024 Daffa Nur Fiat, Syifabela Suratinoyo, Indri Claudia Kolang, Injilia Tirza Ticoalu, Nadira Tri Ardianti Purnomo, Reza Michelly Cantika Mawara, Daniel Sengkey, Angelina Stevany Regina Masengi, Alwin Melkie Sambul
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution-ShareAlikel 4.0 International (CC BY-SA 4.0) that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).