written 4.5 years ago by |

**Logistic regression** is a statistical method for analyzing a dataset in which there are one or more
independent variables that determine an outcome. **Logistic regression** is the appropriate
regression analysis to conduct when the dependent variable is dichotomous (binary). Like all
regression analyses, the logistic regression is a predictive analysis. Logistic regression is used to
describe data and to explain the relationship between one dependent binary variable and one or
more nominal, ordinal, interval or ratio-level independent variables.

**In logistic regression, the dependent variable is binary or dichotomous, i.e. it only contains
data coded as 1 (TRUE, success, Infectious, etc.) or 0 (FALSE, failure, non-infectious, etc.).**

The goal of logistic regression is to find the best fitting (yet biologically reasonable) model to
describe the relationship between the dichotomous characteristic of interest (dependent variable
= response or outcome variable) and a set of independent (predictor or explanatory) variables.
Logistic regression generates the coefficients (and its standard errors and significance levels) of a
formula to predict a *logit transformation* of the probability of presence of the characteristic of
interest:

$logit(p)=b_0+b_1X_1+b_2X_2+b_3X_3+...+b_kX_k$

where p is the probability of presence of the characteristic of interest. The logit transformation is defined as the logged odds:

$odds=\frac{p}{p-1}=\frac{\text{probability of presence of characteristics}}{\text{probability of absence of characteristics}}$

and

$logit(p)=ln(\frac{p}{p-1})$

Rather than choosing parameters that minimize the sum of squared errors (like in ordinary regression), estimation in logistic regression chooses parameters that maximize the likelihood of observing the sample values.

Logistic Regression can work with continuous data(like age and weight) and also with discrete data (astrological sign and genotype). Logistic regression’s ability to provide probabilities and classify new samples using continuous and discrete measurements makes it a popular machine learning method.

**Type of questions that a binary logistic regression can examine.**

How does the probability of getting lung cancer (yes vs. no) change for every additional pound a person is overweight and for every pack of cigarettes smoked per day?

Do body weight, calorie intake, fat intake, and age have an influence on the probability of having a heart attack (yes vs. no)?

**Binary logistic regression major assumptions:**

- The dependent variable should be dichotomous in nature (e.g., presence vs. absent).
- There should be no outliers in the data, which can be assessed by converting the continuous predictors to standardized scores, and removing values below -3.29 or greater than 3.29.
- There should be no high correlations (multicollinearity) among the predictors. This can be assessed by a correlation matrix among the predictors. Tabachnick and Fidell (2013) suggest that as long correlation coefficients among independent variables are less than 0.90 the assumption is met.

The **logistic regression** technique involves dependent variable which can be represented in the
binary (0 or 1, true or false, yes or no) values, means that the outcome could only be in either one
form of two. For example, it can be utilized when we need to find the probability of successful or
fail event. Here, the same formula is used with the additional sigmoid function, and the value of
Y ranges from 0 to 1.

Logistic regression equation :

Linear regression $Y=b_0+b_1\times X_1+b_2\times X_2+...+b_k\times X_k$

Sigmoid Function $P=\frac{1}{1+e^{-Y}}$

By putting Y in Sigmoid function, we get the following result.

$ln(\frac{P}{1-P})=b_0+b_1\times X_1+b_2\times X_2+...+b_k\times X_k$

The following graph can be used to show the logistic regression model.

Logistic functions are used in the logistic regression to identify how the probability P of an event is affected by one or more dependent variables.

**Overfitting.** When selecting the model for the logistic regression analysis, another important
consideration is the model fit. Adding independent variables to a logistic regression model will
always increase the amount of variance explained in the log odds (typically expressed as R²).
However, adding more and more variables to the model can result in overfitting, which reduces
the generalizability of the model beyond the data on which the model is fit.

**Reporting the R 2 .** Numerous pseudo-R 2 values have been developed for binary logistic
regression. These should be interpreted with extreme caution as they have many computational
issues which cause them to be artificially high or low. A better approach is to present any of the
goodness of fit tests available; Hosmer-Lemeshow is a commonly used measure of goodness of
fit based on the Chi-square test.

**Assumptions of Logistic Regression**

Logistic regression does not make many of the key assumptions of linear regression and general linear models that are based on ordinary least squares algorithms – particularly regarding linearity, normality, homoscedasticity, and measurement level.

First, logistic regression does not require a linear relationship between the dependent and independent variables. Second, the error terms (residuals) do not need to be normally distributed. Third, homoscedasticity is not required. Finally, the dependent variable in logistic regression is not measured on an interval or ratio scale.

However, some other assumptions still apply.

First, binary logistic regression requires the dependent variable to be binary and ordinal logistic regression requires the dependent variable to be ordinal.

Second, logistic regression requires the observations to be independent of each other. In other words, the observations should not come from repeated measurements or matched data.

Third, logistic regression requires there to be little or no multicollinearity among the independent variables. This means that the independent variables should not be too highly correlated with each other. Fourth, logistic regression assumes linearity of independent variables and log odds. although this analysis does not require the dependent and independent variables to be related linearly, it requires that the independent variables are linearly related to the log odds.

Finally, logistic regression typically requires a large sample size. A general guideline is that you need at minimum of 10 cases with the least frequent outcome for each independent variable in your model. For example, if you have 5 independent variables and the expected probability of your least frequent outcome is .10, then you would need a minimum sample size of 500 (10*5 / .10).