Comparison of the Performance of Linear Discriminant Analysis and Binary Logistic Regression Applied to Risk Factors for Mortality in Ebola Virus Disease Patients

The aim of our study was to identify risk factors associated with mortality in patients with Ebola virus disease using binary logistic regression and linear discriminant analysis, to assess the predictive power of these two methods and to compare the performance of the models in terms of coefficients and predictions. Our study was a double-blind randomized controlled trial (observational study) conducted in 2018 during the 10th Ebola outbreak in eastern DRC. The study included 363 patients divided into two treatment arms, including 182 patients treated with MAB114 (Ebanga) and 181 patients treated with REGENERON (REGN-EB3). After an in-depth analysis of the data, the two statistical analysis methods selected the same set of variables (risk factors). For binary logistic regression we obtained: viral load 0.58 (0.5-0.67), creatinine 1.98 (1.58-0.67) and aspartate aminotransferase 0.99 (0.9-1); as for linear discriminant analysis, we have viral load (0.88), creatinine (0.94) and aspartate aminotransferase (0.78). We also find almost the same results when different prediction probabilities are evaluated. Logistic regression predicted a 36.5% mortality rate and linear discriminant analysis predicted a 38.8% mortality rate. With regard to the predictive power of the two models, we used the AUC (area under the curve) score and obtained a score of 0.935 for binary logistic regression and 0.6932 for linear discriminant analysis. According to the evaluation hypothesis, the two methods give the same risk factors (viral load (Ctnp), creatinine and alanine aminotransferase (ALT)) with a prediction probability of around 93%.


I. INTRODUCTION
Various predictive statistical methods have been proposed to estimate risk factors associated with disease mortality.The aim of this study was to identify various risk factors associated with mortality in Ebola virus patients using binomial logistic regression and linear discriminant analysis in a randomized trial.The performance of the models on coefficients and predictions was also compared.When the 10th Ebola epidemic broke out in eastern Democratic Republic of Congo in 2018.Logistic regression measures the relationship between the occurrence of an event (qualitatively explained variables) and the factors that may affect it (explanatory variables).The selection of explanatory variables to be included in the logistic regression model is based on prior knowledge of the disease and the statistical association between the variables and the events as measured by odds ratios.The regression is based on maximum likelihood estimation.Discriminant analysis is both a predictive technique (Linear Discriminant Analysis -LDA) and a descriptive technique (Discriminant Factor Analysis -DFA).It is designed to describe and predict the membership of an individual to a group (class).A categorical target variable from the Explanatory/Descriptive Variables collection.Mainly quantitative, but can be qualitative with adjustments.Discriminant analysis is based on least squares estimation, which corresponds to linear estimation [4], [8] [13], [14].This study aims to evaluate the performance of two predictive data analysis methods: Linear Discriminant Analysis and Binary Logistic Regression.The authors of this study have specifically contributed to: a. Study the performance of Linear Discriminant Analysis applied to risk factors linked to Ebola virus disease mortality; b.Study the performance of Binary Logistic Regression on risk factors associated with Ebola virus disease mortality; c.To compare the performance of Linear Discriminant Analysis and Binary Logistic Regression; d.Show that we can achieve the same results with the two methods of predictive data analysis.

A. STUDY POPULATION
The study was conducted in the provinces of North Kivu, South Kivu and Ituri in the Democratic Republic of Congo (DRC) during the 10 ième Ebola outbreak in 2018.Our study was an observational study (a double-blind randomized clinical trial).In August 2018, an Ebola hemorrhagic fever (EHF) outbreak began in the provinces of North Kivu, South Kivu and Ituri in the Democratic Republic of Congo.This was the 10th confirmed Ebola outbreak in the country since Ebola was first reported in Zaire in 1976.After the end of the outbreak in West Africa, the World Health Organization (WHO) launched a series of discussions to advance research on Ebola.This included group work focusing on how experimental therapies should be evaluated in the context of the next Ebola outbreak.These discussions led to a consensus that, where possible, the most promising experimental treatments should be evaluated in randomized controlled trials in the event of a new epidemic.This infrastructure facilitated the unity of the international community and the leadership of the Democratic Republic of Congo in developing and conducting the clinical trials.The study compared two molecules: REGENERON and MAB114 (Ebanga) as these molecules have been shown to be effective in the management of patients [4], [13], [14], [15], [16].The inclusion criteria for the study were a positive Ebola test (positive PCR).The study was jointly approved by the University of Kinshasa ethics committee and the National Institute of Allergy and Infectious Diseases (NIAID) review board, and is overseen by an independent data and safety monitoring committee.Written informed consent was obtained from all patients or their legal guardians and for children in accordance with local standards and requirements [16].

B. DATA COLLECTION AND ANALYSIS
The data used in this study come from the results of a randomized, double-blind, controlled clinical trial conducted on a sample of 363 patients, including 182 patients with MAB114 (Ebanga) and 181 patients with REGENERON (REGN-EB3).The data were entered into REDCAP before being transferred to IBM SPSS version 21.0 and R version 4.2.1 for data analysis.The normality conditions of the variables were determined using the KOLMOGOROV test.The variables were found to be normally distributed (the data follow the normal distribution).The data were then classified by binary logistic regression and linear discriminant analysis.We then determined the risk factors and prediction probabilities for both methods.The logistic model proposes a modeling of the law of   ⁄ =  by a Bernoulli distribution of parameter  ( () = ( = 1  ⁄ = ) as defined in equation (1) such that [1]: Or   ( () =  ' ,  denoting the bijective and derivable function of ]0,1[ The equality in can be written, as explained in equation ( 2), such that [1]: In a logistic model, we make two (2) choices to define the model: 1.The choice of a law for   ⁄ = , here Bernoulli's law 2.The choice of the ( = 1  ⁄ = ) by   ( ( = 1  ⁄ = ) =  '  The function  is bijective.[1], [7] [10], [18], [26].The parameters  . = 1, … .,  − 1 are estimated by maximum likelihood.For an observation (, )is denoted by  !, … ,  / a complete disjunctive coding of i.e.,  .= 1 if  = , 0 otherwise.The likelihood is written [7], as explained in equation (3): (, ) =  ( ! () 0 * … . . ( / () 0 , (3) The likelihood therefore follows a multinomial distribution ℳ(1,  !(), … . 1 ()).This is why this model is also called a "polythomous multinomial model".The maximum likelihood estimators are again obtained by cancelling the partial derivatives with respect to the different parameters of the sample likelihood.
As in the dichotomous case, there are no explicit solutions for the estimators and numerical methods are used to calculate them.There is no real novelty compared to the binary case, the algorithm is simply more delicate to write because of the multiplication of the number of parameters.The odds ratios do not generally appear in the software output for the multinomial model: they must therefore be calculated by hand, taking care to take into account the particular coding of the qualitative explanatory variables.We recall that for an individual x, the odds of an event  =  is equal to the ratio ( =   ⁄ = )/( ≠   ⁄ = ).In the case of the multinomial model, we define the odds of an event  !against an event  2 by equation ( 4) [7]: And for two individuals  7 and  7 ( then the odds ratio is defined in equation ( 5 (5) Thus, if the two individuals  7 and  7 ( differ by only one unit for the variable ℓwe have : ,

D. SPECIFICATION AND INFERENCE TESTING
The various estimation methods presented above lead to asymptotically normal estimators when the number of observations tends to infinity.It is therefore easy to use these various estimators to construct test procedures, some of which will be asymptotically equivalent.We will present here the main test procedures based on the maximum likelihood estimation method which is the most frequently used.The most frequent tests are listed below: 1) Wald test 2) Likelihood Ratio Test (LRT) 3) Test of the score or Lagrange multiplier: LM (Lagrange Mulitplier) It should be remembered that these three tests are asymptotically equivalent, which implies that they can contradict each other on small samples.Moreover, since their distribution is only asymptotically valid, care should be taken when using them on small samples.
It is also known that the LRT test is the most powerful locally and should therefore be preferred a priori.We will only consider here the case of a two-way test on a coefficient or on a set of coefficients.(This is in order to have confidence intervals for the ORs.) [5], [7].

F. DISCRIMINANT ANALYSIS CAN BE A DESCRIPTIVE TECHNIQUE.
This is known as discriminant factor analysis (or descriptive discriminant analysis).The objective is to propose a new system of representation, latent variables formed from linear combinations of the predictor variables, which make it possible to discern groups of individuals as far as possible.
In this sense, it is close to factor analysis because it allows to propose a graphical representation in a reduced space, more particularly to the principal component analysis calculated on the conditional centers of gravity of the clouds of points with a particular metric.It is also known as canonical discriminant analysis, especially in Anglo-Saxon software.

G. DISCRIMINANT ANALYSIS CAN BE PREDICTIVE.
In this case, it is a question of constructing a classification function (assignment rule, etc.) which makes it possible to predict the group to which an individual belongs on the basis of the values taken by the predictive variables.(homoscedasticity hypothesis or equicovariance hypothesis), the calculations will be simplified.This assumption can be interpreted geometrically in terms of the shape and volume of the point clouds in the representation space: these clouds will have the same shape (and volume).In this case, the Bayesian assignment rule is written in equation ( 6) [5], [7]: Indeed, by developing the quantity in equation ( 7) [5], [6]: There are in equation ( 8) [5], [6]: ) is equivalent to maximizing in equation ( 9) [5], [6]: (For  ' ∑ ,!  and ln|∑ | does not depend on ).The maximum likelihood estimators are in equation ( 10) [5], [6]: This gives the classification rule of linear discriminant analysis: in equation ( 11) [6], [7]: Where o is the linear discriminant function of the group  1 (also called the linear ranking function).Each linear discriminant function defines a score function and a new observation will be assigned to the group with the highest score.

H. TESTS AND SELECTION OF DISCRIMINANT VARIABLES 1) HOMOSCEDASTICITY AND BOX TEST
The hypothesis of equality of the matrices ∑ 1 can be tested using the Box test.If the hypothesis ∑ != ∑ 2 = ⋯ = ∑ 1 is true, the quantity: in equation ( 12) [5], [6]: This is followed by approximately one degrees of freedom.

2) WILKS' TESTS
Let the following assumptions apply: Null hypothesis:  -= conditional centres of gravity are merged: independence between    ( !=  2 = ⋯ =  / ).Alternative hypothesis:  != there is at least one centre of gravity that deviates significantly from the others.The test statistic is Wilks' lambda, its expression is as follows: in equation ( 13) [5], [6]: It follows Wilks' law with parameter (,  − ,  − 1) at  - with || represents the determinant of the within-group variance-covariance matrix and || the determinant of the overall variance-covariance matrix. -is rejected if Λ calculated is less than Λ tabulated.This test can be expressed as a multidimensional generalization of the one-factor analysis of variance (ANOVA), in this case we speak of MANOVA (Multidimensional Analysis of Variance).
It is rare to find the Wilks' law table implemented under the various existing statistical software.Therefore, if n is large enough, we will use the following Bartlett approximation: in equation ( 14) [6], [7]: Which follows a law of  2 with P degrees of freedom.
In the case where  = 2we can use the Rao transformation which follows a Fisher distribution with parameter (,  −  − 1)The formula for the test statistic then becomes [5], [6], [21]:

B. MORTALITY BY STUDY ARM
Regarding the mortality of the study, overall there were almost 40% deaths distributed according to the study arms: 43.4% in the Mab114 (Ebanga) group and 35.9% in the REGENERON group (TABLE 3).

TABLE 3 Mortality by study arm
The initial nuclear protein Ct in this study was 22.7 ± 5.6, with 50.4% unvaccinated patients versus 49.6% vaccinated patients.For malaria alone, 14.3% of patients tested positive.The mean baseline blood glucose was 109.4 ± 70.8 mg/deciliter, the mean aspartate aminotransferase was 439.1± 652.3 U/liter and the mean alanine aminotransferase was 321.2 ± 415.9U/liter.Regarding mortality and various biochemical parameters, the variable ctnp in deceased patients indicated that the deceased patients had high viremia.The Ctnp values were lower than 22. 19.39 ± 3.92 in patients receiving Mab114 (Ebanga) and 18.97 ± 3.33 in patients receiving REGENERON.Patients who died had near-normal blood glucose levels.106.29 ± 63.22 in patients receiving Mab114 and 105.72 ± 93.65 in patients receiving REGENERON.For creatinine, potassium and sodium, this table shows that different values are normal in two different groups of deceased patients.We note that the functional indices of hepatitis (aspartate aminotransferase (AST) and alanine aminotransferase (ALT)) in the deceased patients were above normal.

C. MULTIVARIATE ANALYSIS 1) LOGISTIC REGRESSION
Our dependent variable is death and the different (independent) variables considered for the logistic regression are: age, viral load (ctnp), creatinine, glucose, potassium, sodium, aspartate aminotransferase (AST) and alanine aminotransferase (ALT).

2) COMPOSITE TESTS OF MODEL COEFFICIENTS AND MODEL SUMMARY
The statistical indicator for the overall degree of association of the variables in the model is the chi-square.We can conclude that the 8 independent variables are globally associated with patient death (p-value=0.000).We can assess the quality of our regression by means of coefficients of determination: Cox and Snell R-two and Nagelkerke R-two.Our factors: age, viral load (ctnp), creatinine, glucose, potassium, sodium, ALT, AST influenced between 51 and 70% of the deaths of the patients in our study.The model provides us with three (3) factors related to patient deaths among others: viral load 0.58(0.5-0.67),creatinine 1.98 (1.58-2; 49) and aspartate aminotransferase 0.99(0.9-1) .We found that patients with high creatinine levels were twice as likely to die as patients without creatinine.Discriminant analysis is for the explanation and prediction of the membership of individuals in groups (classes), represented by a categorical target variable, by the set of explanatory/ descriptive tables, mainly quantitative but also qualitative upon examination [21].

4) MODEL VALIDATION TEST (BOX TEST AND WILKS' LAMBDA)
To confirm or refute a model in discriminant analysis, we use the Box test and Wilks' lambda.For the Box test, the null hypothesis is the equality of the variance matrices, in our case the probability is lower than the significance level (p-value=0.00),so the null hypothesis is rejected, there is inequality of variances between the different groups.The Wilks' lambda is 0.48 (p-value=0.00),we conclude that our model is valid for the rest of the analysis (TABLE 4).

5) CANONICAL DISCRIMINANT FUNCTION
We note that the 100% discriminating power attributed to the eight (8) variables is attributed to the first discriminating function.The relatively strong canonical correlation (72.1%) testifies to the high utility of the first discriminant function (TABLE 5).

6) COEFFICIENTS OF STANDARDIZED CANONICAL DISCRIMINANT FUNCTIONS
The linear discriminant analysis model yields three (3) factors associated with death in Ebola virus disease patients.

7) EVALUATE THE PREDICTION PERFORMANCE (PREDICTION PROBABILITY) OF TWO METHODS BY THE ROC CURVE.
The logistic regression method predicted mortality for 132 out of 363 subjects, or 36.5%, while the linear discriminant analysis predicted mortality for 141 out of 363 subjects, or 38.8%.Receiver Operating Characteristic (ROC) curve analysis was used to assess the accuracy of predictions between one or more models.In our particular case, the two models (binary logistic regression and linear discriminant analysis) provide almost the same information, i.e. the area under the curve (AUC) is 0.935 for the binary logistic regression and 0.932 for the discriminant analysis (FIGURE 1).

IV. DISCUSSIONS
To evaluate the performance of our study, we compared our results with those of the following authors: In the work by Rani, D. et al [21]   In our case, the study investigated the risk factors associated with mortality of patients with Ebola virus disease using two methods of analysis (logistic regression and linear discriminant analysis).We found the same risk factors influencing the mortality of the patients i.e. viral load, creatinine and aspartate aminotransferase and we evaluated the predictive accuracy of the two models (binary logistic regression and linear discriminant analysis) with the ROC curve, both methods provide almost the same information, i.e. the area under the curve (AUC) is 0.93 for the binary logistic regression and 0.93 for the discriminant analysis.The weakness of this study is that it only takes into account deterministic data to assess the risk factors of patients with Ebola virus, without considering fuzzy data.

V. CONCLUSION
This study focuses on the identification of risk factors associated with death in Ebola patients using binary logistic regression (LR) and linear discriminant analysis (LDA) methods in an observational study (randomized controlled trial).These two methods are most often used to determine the risk factors associated with the disease and sometimes estimate the probability of predicting disease-related mortality based on one or more factors.In the literature we know that some researchers have compared the performance of these two methods in terms of model coefficient, which gives the same results, i.e. the predictors (model variables) we find for logistic regression are the same as those we find in discriminant analysis [8], [17], [21], [22].The main objective of our study was to quantify the factors (variables) influencing patient mortality and to predict the probability of death by using two predictive analysis methods (logistic regression and linear discriminant analysis) to evaluate the results produced by these two models.Note that the coefficients of the logistic and discriminant models give us the same information, i.e. the same risk factors.The factors provided by logistic regression are: viral load 0.58 (0.5-0.67), creatinine 1.98 (1.58-2.49)and aspartate aminotransferase 0.99 (0.9-1).For linear discriminant analysis we have viral load (0.88), creatinine (0.94) and aspartate aminotransferase (0.78) [25].We see the same results even when different prediction probabilities are evaluated.The logistic regression method predicted a mortality rate of 132 out of 363 or 36.5%, while the linear discriminant analysis predicted a mortality rate of 141 out of ) [7]: ( 7 , 7 ( ,  =  =  !  =  2 ) = " !(34.* 5 ⁄ 4* .)/" !(34.-5 ⁄ 4* .) " !(34.* 5 ⁄ 4* .()/"!(34.-5⁄ 4* .() = exp (( .* −  .-)′( 7 - 7 ( ))
[7]istic regression, developed in 1944 by Joseph Berkson (American Ph ysicist, Physician and Statistician, born in 1899 and died in 1982), allows the discrimination of a binary or polytomous response variable ( ≥ 2 classes) from a matrix of  explanatory variables  = ( !, … ,  " ) The data can be a mixture of two different formats: continuous and qualitative.The strength of logistic regression lies in the form of the link function used (the logit or probit) and which allows a sigmoidal form of modelling, including the notion of slope influenced by the frequency of observations, in the form of weights by sector, when we move from one sector to another according to the class described by the response  [1],[7].
C. LOGISTIC REGRESSION' The qualitative variable  therefore has  modalities (it is the modalities of  which define the classes).;acolumn of  describing j th individual. 1 : the group of individuals in the sample who have the modality   1 = ( 1 ) the number of individuals who have the modality  The sets{ 7 ∖  = 1,  `````} ⊂ ℝ &  { .∖= 1,  `````} ⊂ ℝ ;denote the clouds of individuals and variables respectively.The variable  allows us to define, taking into account the  modalities, a partition of the set of individuals into  subsets  !,  2 … .,  / with the individual  belonging to  / if it is the k th modality of the qualitative variable which is realized.For our case we will present the discriminant analysis as a predictive technique (linear discriminant analysis).
can be considered as an extension of the regression problem to the case where the variable to be explained is qualitative.We have  individuals (or observations) described by  variables and divided into  classes (groups) given by the qualitative variable .The  classes are known a priori.!… .∑ = ∑ 1 )