Regression Algorithms in Predicting the SARS-CoV-2 Replicase Polyprotein 1ab Inhibitor: A Comparative Study

Due to its extensive steps and trials, drug discovery is a long and expensive process. In the last decade, as also hard pressed by the COVID-19 pandemic, the screening process could be assisted with the advancement in computational technology including the application of Machine Learning. The classification task in Machine Learning has become one of the major approaches for drug discovery. Unfortunately, this practice uses discretized labels that might lead to the loss of quantitative properties that could be meaningful. Therefore, in this paper, we aim to compare various Machine Learning regression algorithms in predicting inhibitory bioactivity, specifically the IC 50 value, with the SARS-CoV-2 Replicase Polyprotein 1ab as the target. With 1,138 non-duplicated data downloaded from the ChEMBL database that was engineered into four dataset variances, 42 regression algorithms were utilized for the prediction. We found that there are computational challenges to the use of regression algorithms in predicting bioactivity, for only a handful and a specific dataset variance that returned valid performance parameters upon testing. The three that yielded the highest counts of valid performance parameters are the Histogram Gradient Boosting Regressor (HGBR), Light Gradient Boosting Machine Regressor (LGBR), and Random Forest Regression (RFR). Further statistical analyses show that there is no significant difference between these three algorithms, except for the time taken for training and testing the model, where the LGBR excels. Therefore, these three algorithms should be primarily considered for the study with the same nature.


I. INTRODUCTION
It is well known that drug discovery is a both lengthy and costly process [8].The steps that cover pre-clinical, including several trials are found to be similar in several developed countries and may take 11.5 years until a market introduction [9], [10].However, in late 2019 a new form of coronavirus started infecting humans and even led to a global catastrophe in early 2020 [11].The catastrophic circumstances brought a dire need for the accelerated discovery of novel therapeutic options such as new healthcare technology [12] and Computer Aided Drug Discovery (CADD) [13].
In silico drug discovery, where computational powers are utilized, is a solution that has been extensively studied in recent years, especially in the pre-clinical step [10], [13]- [16].The methods include molecular dynamic simulation (MDS) [14] and machine learning [14]- [16].It has been used for discovering new compounds or repurposing known drugs, and until 2021, there were more than 70 approved drugs that were utilized in silico methods in its pre-clinical stage [14].
COVID-19 caused by the SARS-CoV-2 virus is one of the diseases that received global attention and where the CADD is actively researched [17].The drug candidate for SARS-CoV-2 may target human cells or, at the other end, target the virus itself [18].In this case, the 3-chymotrypsin-like protease (3CLpro) or the Main Protease (Mpro), which is one of the key proteases of the SARS-CoV-2 that is important in replication is an enticing target due to the inexistence of its human homolog [19], [20].This is the common scenario as reported in various studies [20]- [25].Another scenario targets the transmembrane protease serine 2 (TMPRSS2) at the host, rather than targeting the virus protein due to the highly mutative nature of the virus [26].
Machine Learning is a progressing field study concerning how computers can construct knowledge from a set of data (experience) [27].It has been used to predict interactions between drugs and proteins, discover efficacy, and confirm the safety biomarkers [28].Sulistiawan et al. [29] used a Deep Semi-Supervised Model to predict the Drug-Target Interaction (DTI) of the antiviral drug candidate of SARS-CoV-2.Machine Learning has also been used to discover the potential of Indonesian herbal compounds, as well as Traditional Chinese Medicine (TCM) as antiviral agents for SARS-CoV-2 [5], [15], [23].In a broader sense, Machine Learning also was used to predict drug synergy in cancer cells [30], as well as empowering web servers that can be used to predict inhibitory activities [2]- [4].
In terms of tasks, classification is the common one, applied in drug discovery.Drugs with known interaction will be labeled as 1 and those with no interaction will be labeled as 0 [2]- [5], [29]- [32].The label itself might have come from the DTI database or based on the half maximal inhibitory concentration value (IC50) [33], where the concentrations that are less than 1000 nM, between 1000 and 10000 nM, and greater than 10000 nM are labeled as active, intermediate, and inactive, respectively.In many cases, the intermediate ones are omitted.
Regardless of its popularity in drug discovery, the use of discretized labels based on IC50 reduces the data resolution, as well as complicates comparisons and evaluations with other numeric parameters.Grouping of numeric data is discouraged in epidemiology studies [34].Therefore, in this study, we use an uncommon approach in drug discovery, where we apply prediction instead of classification.To gain better insights, we apply various regression algorithms with the IC50 value used as a label.The purpose is to evaluate the possibility of using regression in place of classification in Machine Learningbased drug discovery.The motivation is to preserve the quantitative properties in the label, which are commonly lost when the label is discretized into groups in classification studies.The availability of the predicted inhibitory bioactivity in its original quantitative form should provide a better chance of comparison with other drug discovery approaches in the future.The SARS-CoV-2 Replicase Polyprotein 1ab is used as our target protein since we are still moving on from the pandemic to epidemic status.The remainder of this article is organized as follows: in Section II we present the workflow as well as steps involved in this research, followed by the Results and Discussion in Section III.In Section IV the paper is concluded and some potential issues to be explored are presented.

II. METHODS
The method is similar to common data science methodology with some addition of the Chemical Space Analysis, which is common in drug discovery, and statistical comparison of the algorithms [2], [3], [24], [35].The whole process, as illustrated in FIGURE 1, comprised three main parts: data preparation, Quantitative Structure-Activity Relationship (QSAR) modeling and evaluation, and statistical analysis.

A. DATA PREPARATION
This part includes data collection and preprocessing.For this part, we used a free Google Colab service.The bioactivity dataset is queried from ChEMBL by using its web service and the provided Python Application Programming Interface (API) [1].The ChEMBL ID for the target is CHEMBL4523582, queried on June 25 th , 2023.At the moment, there were 1,220 inhibitory bioactivity data of the SARS-CoV-2 Replicase Polyprotein 1ab stored in the ChEMBL database.The queried data includes fields such as the compounds in Simplified Molecular-input-line-entry system (SMILES) notation and the IC50 value towards the target of each compound.Later, the duplicated compounds are filtered, resulting in 1,138 non-duplicated inhibitory bioactivity data.The next step is labeling each compound bioactivity as "active", "intermediate", or "inactive", based on the respective IC50.The values less than 1000 nM are labeled as active, while those greater than 10000 nM are labeled as inactive, and those in between are labeled as intermediate.The frequency of each bioactivity class is shown in FIGURE 2.
The SMILES representation of each compound is standardized using string processing.The standardized SMILES is then used for calculating the pharmacokinetic profile of each compound by using the Python rdkit library version 3.1 [36], [37].The calculated profile includes the number of hydrogen bond donors (NumHDonors), the number of hydrogen bond acceptors (NumHAcceptors), molecular weight (MW), and the Ghose-Crippen-Viswanadhan octanolwater partition coefficient (LogP).Standardized SMILES are also used to calculate Pubchem molecular fingerprints, by using the function from the Padelpy library [38].Then, the IC50 values are converted to pIC50, and along with the standardized SMILES, and PubChem fingerprints, form the bioactivity dataset.

B. QSAR MODELING AND EVALUATION
The bioactivity from the previous step was then used as the dataset for training and testing the regression algorithms.The dataset was split with an 80:20 proportion for the training and testing data, respectively using the Scikit-learn Library [39].Then, we used the lazypredict library for batch training and evaluation of the regression models [40].In total, 42 regression algorithms were tested.The dataset splitting, model training, and evaluation were repeated 100 times, with the iteration number also serving as the randomization seed.Four dataset variances were used: 1. Full dataset, where the whole 881 features of the PubChem fingerprints as well as all bioactivity classes were used.2. Without low variance features, the PubChem fingerprints that had insignificant variance were not included in the modeling and evaluation process.3. Without the intermediate bioactivity class, only the compounds with IC50 that are less than or equal to 1000 nM and those greater than or equal to 10000 nM that included in the modeling and evaluation process.

Without low variance features and intermediate
bioactivity class, which yielded the smallest dataset of all fours.The modeling and evaluation part produced the performance dataset of the regression algorithms, with four dataset variances, where each scenario was repeated 100 times.The recorded performance parameters are R 2 , Adjusted R 2 , Root Mean Squared Error (RMSE), and Time Taken.
For the Modeling and Evaluation of the algorithms, we used a virtual machine with Ubuntu 20.04.6 with kernel 5.4.0-155generic.The installed Python version is 3.8 along with libraries in used related to this research are lazypredict 0.2.12, numpy 1.24.3, and scikit-learn 1.3.0.The host has an Intel Xeon E5-2630 v4 CPU with eight cores allocated to the virtual machine.The allocated RAM is 32 GB.

C. STATISTICAL ANALYSIS
The statistical analysis part consists of two subparts.The first one is the chemical space analysis, where statistical methods were used to explore the characteristics of the compounds in each bioactivity class.This helps to understand the chemical nature of the compounds in each class.The other second subpart is statistical comparisons of the performance of each algorithm in each scenario (dataset variance).Descriptive and inference techniques are used to compare the algorithms.As the result could be erroneous due to a programming glitch in the library or the data characteristics that are not compatible with certain algorithms, the performance data must be filtered according to the theoretically possible R 2 and Adjusted R 2 values before applying the statistical techniques.Last, we draw a conclusion based on the statistical analysis results.

A. CHEMICAL SPACE ANALYSIS
The chemical space analysis is aimed at understanding the chemical characteristics between the three bioactivity classes.First, the distributions of the bioactivity classes are visualized as the function of MW and LogP as shown in FIGURE 3(a).Second, the bioactivities are compared as the distributions of Lipinski's rule-of-five descriptors, as shown in FIGURE 3 intermediate bioactivity that strayed from the others.This fact indicates that most of the studied compounds are below 800 Da and LogP below 7.5.The boxplots of Lipinski's descriptors show that the compounds mostly satisfied Lipinski's rule of five.
T confirms the nonparametric natures of the descriptors in each bioactivity class.The Shapiro-Wilk's test of distribution was applied.Under a 95% confidence interval, all the p-values are below α = 0.05, indicating the values are not normally distributed.Therefore, we continued with the Kruskal-Wallis test for each descriptor, using the bioactivity classes as groups, as shown in TABLE 2. As can be observed, there is at least a group of compounds that is significantly different.Hence, the analysis continued with a post-hoc test, they were using Dunn's method with Bonferroni adjustment.The results are shown in TABLE .It can be seen that in terms of molecular weight, the active and inactive compounds have no significant difference, while between the two and the intermediate class, there are significant differences.A similar pattern also can be observed in the number of hydrogen bond acceptors.In contrast, there is no significant difference in the number of hydrogen bond donors between the inactive and the intermediate compounds, while they are significant between the inactive and active compounds, and between the intermediate and active compounds.

B. PERFORMANCE OF THE REGRESSION ALGORITHMS
Before feeding the data into the algorithms, the dataset was assessed for its modelability.This step is required to ensure the dataset could produce predictive QSAR models, identified by the Modelabiliy Index (MODI) that is greater than 0.65 [41].In this study, the MODI is calculated utilizing an R code that was previously used in several published works by other researchers [2], [3].The result shows that our curated dataset has a MODI of 0.768, indicating its modelability and the potential to yield predictive QSAR models.

R 2 AND ADJUSTED R 2
The QSAR modeling and evaluation part was executed with 42 algorithms, four dataset variances, and 100 repetitions, or in total, 33,600 experiments.However, due to various factors, the process only yielded 33,028 performance data.From the Exploratory Data Analysis (EDA) on the performance data, we found that much individual repetition resulted in invalid R 2 and/or Adjusted R 2 scores on the model testing part.Moreover, none of the performance parameters in the full dataset and the one without the intermediate bioactivity were found to be within the valid range.Fig. 4 shows the distributions of the valid R 2 and Adjusted R 2 .It is only the dataset with omitted low variance features that have the testing performance parameters within the valid range.Therefore, further discussions are only concerned with the QSAR modeling yielded by this particular dataset.Out of 42 algorithms, only 10 fall within this category, with the data frequencies shown in Fig. 5. Fig. 6 shows the values of the performance parameters from the model testing part.The shapes of the boxes correspond to the distributions of values.Since the Histogram Gradient Boosting Regressor (HGBR), Light Gradient Boosting Machine Regressor (LGBR), and Random Forest Regressor (RFR) are those that have more frequencies, therefore the boxes have larger spans, denoting the standard deviations.Therefore, further comparisons will be focused on these three algorithms.The Shapiro-Wilk tests of distribution normality in TABLE 4 show that the parameters produced by the LGBR algorithm are not normally distributed, hence further comparison must use a non-parametric method.TABLE 5 shows the results of the Kruskal-Wallis tests for both evaluation parameters between the algorithms.On either parameter, the difference is not significant.

COST FUNCTION
The Root Mean Squared Error (RMSE) was used as the cost function upon QSAR modeling and evaluation.It is calculated by taking the root of the squared average of the difference between each prediction and the actual value.Since the value reflects errors, which means the lower the RMSE, the better the model's prediction capability.Fig. 7 shows the RMSE boxplots of the algorithms from the model testing of the nonlow variance dataset.It can be seen that the top three frequencies tend to have lower RMSE compared to the other algorithms.The Shapiro-Wilk test in TABLE 6 shows that each of these top three frequency algorithms has normally distributed RMSE.Therefore, we continue comparing the RMSE using ANOVA.The result shown in TABLE 7 indicates that there is no significant difference in the RMSE of the HGBR, LGBR, and RFR algorithms.

TIME TAKEN
This parameter highlights the time needed for either training or testing.The lower the Time Taken means it took a shorter time for training and/or testing.Fig. 8 shows the distribution of time taken for training and testing the model produced by each algorithm.The Shapiro-Wilk test in TABLE 8 shows that for each algorithm in both phases, the durations are not normally distributed.In light of the distribution normality, the Kruskal-Wallis test is used to compare the time taken between the three algorithms.As shown in TABLE 9, there is at least one algorithm that has a significantly different duration in both training and testing.Using the Dunn test for post-hoc analysis shown in TABLE 10, we found the time taken between these three algorithms is significantly different.As reflected by the boxplots in Fig. 8, it is the LGBR that has the fastest time for both training and testing for this particular case.

IV. DISCUSSION
Conventional drug discovery is a tedious and expensive process.Fortunately, with the recent advances in computing technologies, in silico processes could be exercised for screening potential drug candidates.Classification is one of the supervised machine learning tasks that is gaining popularity for identifying potential bioactivity.In some studies, the bioactivity is based on the IC50 values that are grouped into two or three groups.Despite it is being a common approach, however, this practice reduces the information that can be learned by the model, as well as losing some statistical properties [34], [42], [43].
In this study, we applied regression algorithms to predict the IC50, instead of classification.We argue that this approach should give more information, and better comparability with results from other studies.The SARS-CoV-2 Replicase Polyprotein 1ab is taken as a target.Using curated bioactivity data from the ChEMBL database, we tested the approach with 42 regression algorithms and four dataset variances that repeated 100 times.Counting the training and testing phases, the experiment yields 33,028 data or only about 98.29% of the expected 33,600 data.By investigating the experiment logs,  the reduction was caused by some algorithms that failed to run properly due to glitches or incompatibilities with the data.Moreover, much of the resulting testing phase performance parameters, namely the R 2 and the Adjusted R 2 are falling out of the theoretically possible range.In the end, only some of the results from the no-low variance dataset can have meaningful model testing performance parameters.Algorithms-wise, only 10 out of 42 satisfy the requirement, and only three of them can be considered stable enough for inference, judging from the number of valid R 2 and Adjusted R 2 values.From these three algorithms, namely Histogram Gradient Boosting Regression (HGBR), Light Gradient Boosting Machine Regression (LGBR), and Random Forest Regression (RFR), they share similar qualities in terms of predicting capability (R 2 and Adjusted R 2 ) and the observed cost function (RMSE).It is only the Time Taken that has significant differences between the algorithms, where LGBR came up with the shortest durations, followed by HGBR and then the RFR.
Random Forest Classifier is a known Machine Learning method in drug discovery, as demonstrated in several studies [2]- [6].Despite the popularity of its counterpart, as far as this study is concerned, the performance of the Random Forest algorithm can be matched by HGBR and LGBR.If there is an emphasis on time, the LGBR should be highly considered, although further investigations must be carried out.On the contrary, the Support Vector Regression (SVR) did not perform as expected, despite it being reported to be proficient in QSAR modeling [7].
Our study has several implications for the use of Machine Learning algorithms in drug discovery.First, we have explored the use of regression algorithms where classification is the commonly applied technique in the field.We have also found that the application of regression algorithms in this particular field faces several computational as well as algorithmic challenges where most of the model performance parameters in the testing phase were invalid.However, the possibility of getting a higher range of information compared to the discretized approach in classification becomes a strong point for more investigation.Despite our experiment yielding HGBR, LGBR, and RFR as the most stable regression algorithms that might not be the case with other datasets, such as different targets or molecular descriptors.Moreover, compared to the accuracy of the previous study that targeted the SARS-CoV-2 3CLpro [3], which is about 70%, the algorithms we evaluated still performed poorly.Whether hyperparameter tuning can be exploited to increase performance, should be thoroughly explored.Therefore, further studies must be undertaken to answer these questions.

V. CONCLUSION
Drug discovery is a long and expensive process.Fortunately, due to the advancement in computational technology and cheminformatics, those properties could be reduced for good reasons.When to COVID-19 pandemic was starting to hit the world, there was an urgency for an accelerated development of novel as well as repurposed drugs.Machine Learning which  is a branch of computational science is extensively used in this part, with the classification task as the major approach.
In this study, we explored the use of regression algorithms to predict the compounds' inhibitory bioactivity toward the SARS-CoV-2 replicase polyprotein 1ab.By using the IC50 data from the ChEMBL database and PubChem descriptor for converting the compounds into numeric vectors, we experimented with the combinations of algorithms and dataset variances.Evaluating the results, we uncovered that the use of regression algorithms is challenging, for we got many invalid performance parameters from the model testing phase, where the R 2 and/or Adjusted R 2 values are out of the theoretical range.However, the use of regression algorithms in this particular field would yield more information than the discretized ones in the classification task.We found that for the particular dataset used in this study, the regression algorithms of Histogram Gradient Boosting, Light Gradient Boosting Machine, and Random Forest yielded the highest counts of valid results.
Regardless of the use of Random Forest Classification in some previous studies, from our experiments, its regression counterpart has no statistically significant performance differences compared to the Histogram Gradient Boosting Regressor and the Light Gradient Boosting Machine Regressor.Surprisingly, the Light Gradient Boosting Machine Regressor has a significantly lower duration for either training or testing, making it a promising algorithm to be explored for this particular case.
There should be more studies that investigate the use of regression algorithms in predicting inhibitory bioactivity.The three algorithms should even be explored from several different approaches.The performances of each algorithm could be improved by using the appropriate hyperparameter tuning.In this study, we only converted the dataset to a feature matrix with a single molecular fingerprint.As different fingerprint leads to a different form of feature matrix, it should lead to different performance.This could be addressed in future studies.Moreover, a different target enzyme or protein might have another set of algorithms that suits it.Hence, this study then could be repeated with a different target.

FIGURE 2 .
FIGURE 2. Compounds frequency in each bioactivity class.

FIGURE 3 .
FIGURE 3. (a) Chemical space analysis as a function of molecular weight (MW) and octanol-water partition (LogP); (b)-(e) Distributions of the Lipinski's descriptors for each bioactivity class.

FIGURE 8 .
FIGURE 8. Distributions of the training and testing durations of each algorithm.