5.1 Assumptions for logistic regression:

  • The response variable \(Y\) is a binomial random variable with a single trial and success probability \(\pi\). Thus, \(Y=1\) corresponds to “success” and occurs with probability \(\pi\), and \(Y=0\) corresponds to “failure” and occurs with probability \(1-\pi\).
  • The set of predictor or explanatory variables \(x=\left(x_{1}, x_{2}, \ldots, x_{k}\right)\) are fixed (not random) and can be discrete, continuous, or a combination of both. As with classical regression, two or more of these may be indicator variables to model the nominal categories of a single predictor, and others may represent interactions between two or more explanatory variables.
  • Together, the data is collected for the \(i\) th individual in the vector \(\left(x_{1 i}, \ldots, x_{k i}, Y_{i}\right)\), for \(i=1, \ldots n\). These are assumed independent by the sampling mechanism. This also allows us to combine or group the data, which we do below, by summing over trials for which \(\pi\) is constant. In this section of the notes, we focus on a single explanatory variable \(x\).

The model is expressed as \[ \log \left(\frac{\pi_{i}}{1-\pi_{i}}\right)=\beta_{0}+\beta_{1} x_{i} \] Or, by solving for \(\pi_{i}\), we have the equivalent expression \[ \pi_{i}=\frac{\exp \left(\beta_{0}+\beta_{1} x_{i}\right)}{1+\exp \left(\beta_{0}+\beta_{1} x_{i}\right)} \] To estimate the parameters, we substitute this expression for \(\pi_{i}\) into the joint pdf for \(Y_{1}, \ldots, Y_{n}\) \[ \prod_{i=1}^{n} \pi_{i}^{y_{i}}\left(1-\pi_{i}\right)^{1-y_{j}} \] to give us the likelihood function \(L\left(\beta_{0}, \beta_{1}\right)\) of the regression parameters. By maximizing this likelihood over all possible \(\beta_{0}\) and \(\beta_{1}\), we have the maximum likelihood estimates (MLEs): \(\hat{\beta}_{0}\) and \(\hat{\beta}_{1}\). Extending this to include additional explanatory variables is straightforward.

Binary Logistic Regression

Binary logistic regression models how the odds of “success” for a binary response variable \(Y\) depend on a set of explanatory variables: \[ \operatorname{logit}\left(\pi_{i}\right)=\log \left(\frac{\pi_{i}}{1-\pi_{i}}\right)=\beta_{0}+\beta_{1} x_{i} \] - Random component - The distribution of the response variable is assumed to be binomial with a single trial and success probability \(E(Y)=\pi\). - Systematic component - \(x\) is the explanatory variable (can be continuous or discrete) and is linear in the parameters. As with the above example, this can be extended to multiple variables of non-linear transformations. - Link function - the log-odds or logit link, \(\eta=g(\pi)=\log \left(\frac{\pi_{i}}{1-\pi_{i}}\right)\), is used.