Logistic Regression - ANSWER-Commonly used for modeling binary response data. The response variable is a binary variable, and thus, not normally
... [Show More] distributed.
In logistic regression, we model the probability of a success, not the response variable. In this model, we do not have an error term
g-function - ANSWER-We link the probability of success to the predicting variables using the g link function. The g function is the s-shape function that models the probability of success with respect to the predicting variables
The link function g is the log of the ratio of p over one minus p, where p again is the probability of success
Logit function (log odds function) of the probability of success is a linear model in the predicting variables
The probability of success is equal to the ratio between the exponential of the linear combination of the predicting variables over 1 plus this same exponential
Odds of a success - ANSWER-This is the exponential of the Logit function
Logistic Regression Assumptions - ANSWER-Linearity: The relationship between the g of the probability of success and the predicted variable, is a linear function.
Independence: The response binary variables are independently observed
Logit: The logistic regression model assumes that the link function g is a logit function
Linearity Assumption - ANSWER-The Logit transformation of the probability of success is a linear combination of the predicting variables. The relationship may not be linear, however, and transformation may improve the fit
The linearity assumption can be evaluated by plotting the logit of the success rate versus the predicting variables.
If there's a curvature or some non-linear pattern, it may be an indication that the lack of fit may be due to the non-linearity with respect to some of the predicting variables
Logistic Regression Coefficient - ANSWER-We interpret the regression coefficient beta as the log of the odds ratio for an increase of one unit in the predicting variable
We do not interpret beta with respect to the response variable but with respect to the odds of success
The estimators for the regression coefficients in logistic regression are unbiased and thus the mean of the approximate normal distribution is beta. The variance of the estimator does not have a closed form expression
Model parameters - ANSWER-The model parameters are the regression coefficients.
There is no additional parameter to model the variance since there's no error term.
For P predictors, we have P + 1 regression coefficients for a model with intercept (beta 0).
We estimate the model parameters using the maximum likelihood estimation approach
Response variable - ANSWER-The response data are Bernoulli or binomial with one trial with probability of success
MLE - ANSWER-The resulting log-likelihood function to be maximized, is very complicated and it is non-linear in the regression coefficients beta 0, beta 1, and beta p
MLE has good statistical properties under the assumption of a large sample size i.e. large N
For large N, the sampling distribution of MLEs can be approximated by a normal distribution
The least square estimation for the standard regression model is equivalent with MLE, under the assumption of normality.
MLE is the most applied estimation approach
Parameter estimation - ANSWER-Maximizing the log likelihood function with respect to beta0, beta1 etc in closed (exact) form expression is not possible because the log likelihood function is a non-linear function in the model parameters i.e. we cannot derive the estimated regression coefficients in an exact form
Use numerical algorithm to estimate betas (maximize the log likelihood function). The estimated parameters and their standard errors are approximate estimates
Binomial Data - ANSWER-This is binary data with repititions
Marginal Relationship - ANSWER-Capturing the association of a predicting variable to the response variable without consideration of other factors
Conditional Relationship - ANSWER-Capturing the association oof a predicting variable to the response variable conditional of other predicting variables in the model
Simpson's paradox - ANSWER-This is when the addition of a predictive variable reverses the sign on the coefficients of an existing parameter
It refers to reversal of an association when looking at a marginal relationship versus a partial or conditional one. This is a situation where the marginal relationship adds a wrong sign
This happens when the 2 variables are correlated
Normal Distribution - ANSWER-Normal distribution relies on a large sample of data. Using this approximate normal distribution we can further derive confidence intervals.
Since the distribution is normal, the confidence interval is the z-interval
**Applies for Logistic & Poisson Regression
Hypothesis Testing (coefficient == 0) - ANSWER-To perform hypothesis testing, we can use the approximate normal sampling distribution.
The resulting hypothesis test is also called the Wald test since it relies on the large sample normal approximation of MLEs
To test whether the coefficient betaj = 0 or not, we can use the z- value
**Applies for Logistic & Poisson Regression
Wald Test (Z-test) - ANSWER-The z-test value is the ratio between the estimated coefficient minus 0, (which is the null value) divided by the standard error
We reject the null hypothesis that the regression coefficient is 0 if the z value (gets too large) is larger in absolute value than the z critical point, (or the 1- alpha over 2 of the normal quantile).
We interpret that the coefficient is statistically significant
**Applies for Logistic & Poisson Regression
Hypothesis Testing (coefficient == constant) - ANSWER-To test if the regression coefficient is equal to this constant b, then the z-value changes.
We subtract b from the estimated coefficients of the numerator
We decide to reject/accept using the P-value
The P-value is 2 times the left tail of the standard normal of the quantile provided by the absolute value of the z-value
P-value = 2P(Z > |z-value|)
**Applies for Logistic & Poisson Regression
Hypothesis testing (statistical significance: +/-) - ANSWER-Here, the z-value is the same but the P-value will change
Positive:
P-value = P(Z > z-value)
Negative:
P-value = P(Z < z-value)
**Applies for Logistic & Poisson Regression
Statistical Inference - ANSWER-Logistic Regression: Normal Distribution. The statistical inference based on the normal distribution applies only under large sample data. If the sample size, or n, is small? Then the statistical inference is not reliable i.e. warn on the lack of the reliability of the results
Standard Regression: T-Distribution. The statistical inference relies on the distribution that applies under both small and large samples
**Applies for Logistic & Poisson Regression
Type I Error - ANSWER-This happens if the sample size, or n is small. The hypothesis testing procedure will have a probability of Type I error larger than the significance level (i.e. more Type I errors than expected)
**Applies for Logistic & Poisson Regression
Deviance - ANSWER-This is the difference between the log likelihood from a reduced model and the log likelihood from a full model
For large sample size data, the distribution (assuming the null hypothesis is true), is a chi square distribution with Q degrees of freedom
Q = Number of Z predicting variables (controlling variables for bias selection) i.e. the number of regression coefficients discarded from the full model to get the reduced model
The P-value of the test is computed as the right tail of the chi-square distribution with Q degrees of freedom of the test value (Deviance)
**This test is NOT a goodness of fit test. It simply compares two models and decides whether the larger model is statistically significantly better than the reduced model.
Coefficient Test (Deviance) - ANSWER-The hypothesis testing procedure is testing the null hypothesis that all alpha coefficients are zero, versus the alternative that at least one alpha coefficient is not zero
For the testing procedure for subsets of coefficients, we compare the likelihood of a reduced model versus a full model.
This test provides inferences on the predictive power of the model. Predictive power means that the predicting variables predict the data even if one or more of the assumptions do not hold
Overall Regression - ANSWER-Standard Regression: We use the F test to test for the overall regression
Logistic Regression: We use the difference between the log likelihood function of the model under the null hypothesis (also called the null-deviance), and the log likelihood of the full model (residual deviance) i.e. the difference between the null deviance and the residual deviance
Overall Regression (Logistic) - ANSWER-The test statistic is a chi-squared distribution with p degrees of freedom where p is the number of predicting variables.
We reject the null hypothesis when the P-value is small, indicating that the overall regression has explanatory power.
Data w/ replications vs Data w/o replications - ANSWER-Data with replications:
We can observe binary data for repeated trials. That is a binomial distribution with more than one trial or ni greater than 1
Data without replications:
For each unique set of the observed predicting variables, we can observe binary data with no repeated trials. That is a binomial distribution with one trial where ni = 1
Logistic Regression with replications - ANSWER-Residuals: We can only define residuals for binary data with replications
Goodness of Fit: We perform goodness of fit only for logistic regression with replications under the assumption that Yi is binomial with ni greater than 1
Pearson Residuals - ANSWER-This is the standardized difference between the ith observed response and estimated expected response, which is ni times the probability of success
We need to standardize the difference between observed and expected response, as the responses have different variances
Pearson residuals have an approximately standard normal distribution
Deviance Residuals - ANSWER-These are the signed square root of the log-likelihood evaluated at the saturated model when we assume that the estimate expected response is the observed response versus the fitted model
Deviance residuals have a standard normal distribution if the model is a good fit (i.e. model assumptions hold)
Goodness of Fit - ANSWER-We can use the Pearson or Deviance residuals to evaluate whether they are normally distributed. If they're normally distributed, we conclude that the model is a good fit
If the model is not a good fit, it means the linearity assumption may not hold
Goodness of Fit Test - ANSWER-The null hypothesis is that the model fits well. The alternative is that the model does not fit well
The test statistic for the goodness of fit test is the sum of squared deviances which has a Chi-Square distribution with n- p- 1 degrees of freedom
If the p-value is small, we reject the null hypothesis of good fit, and thus we conclude that the model is not a good fit. We want LARGE values of P. Large values of P indicate that the model may be a good fit
For goodness of fit test, we compare the likelihoods of the saturated model versus the fitted model.
Goodness of Fit (binary data with no responses) - ANSWER-Use the deviances from the aggregated model for goodness of fit, not based on the individual level data
Reasons why a model may not be a good fit - ANSWER-There may be other variables that should be included in the model
The relationship between Logit of the expected probability and predictors might be multiplicative, rather than additive
Departure from the linearity assumption
Initial observations outliers, leverage points are also still an issue for this model
Logit function does not fit well with the data
The binomial distribution isn't appropriate. For example, if there's correlation among the responses or there's heterogeneity in the success that hasn't been modeled. Both of these violations can lead to what we call overdispersion
Overdispersion - ANSWER-This is where the variability of the probability estimates is larger than would be implied by a binomial random variable
ɸ = D/(n-p-1)
D is the Deviance(sum of squared deviances)
If ɸ > 2 then model is overdispersed; an over-dispersed model will fit better
Overdispersion impacts the estimated variance and statistical inference. If overdispersion is not accounted for, statistical inference will not be as reliable
Link Functions - ANSWER-C-log Function: This has very long tails, meaning that it works best in extremely skewed distributions
Probit Function: This is the inverse of the CDF of a standard normal distribution. This fits data with least-heavy tails among the three S shaped functions. This would work well when the probabilities are all concentrated within a small range
Logit Function: This is what is called the canonical link function, which means that parameter estimates under logistic regression are fully efficient and tests on those parameters are better behaved for small samples. The interpretations of regression coefficients in terms of log odds is possible with a logit function but not other S-shape functions
Classification - ANSWER-Classification is prediction of binary responses.
If the predicted probability is large, then classify y star as a success
Classification Error Rate - ANSWER-Classification error rate is the probability that the new response is equal to the classifier(R)
R is between 0 and 1. Most common value for R is 0.5 however a different R can be used to improve the prediction accuracy
Training Error - ANSWER-This is the proportion of the responses that are misclassified
We cannot use the training error rate as an estimate of the true error classification error rate because it is biased downward
The bias comes from the fact that we use the data twice. One, we used it for fitting the model and the second time is to estimate the classification error rate
Cross Validation - ANSWER-This is a direct measure of predictive power
Random sampling is computationally more expensive than the K-fold cross validation, with no clear advantage in terms of the accuracy of the estimation classification error rate
The rule of thumb for choosing K is about K = 10
LOOCV is a K-fold cross validation with K = n. The larger K is, the larger the number of folds, the less bias the estimate of the classification the error is but has higher variability.
LOOCV - ANSWER-LOOCV can be approximated by the sum between the training risk + the complexity penalty.
The complexity penalty is (2 * # of predictors in submodel * estimated_variance of submodel)/n
The variability of the submodel is smaller than that of the full model, thus LOOCV penalizes complexity less than Mallow's Cp
LOOCV is ~ AIC when the true variance is replaced by the estimate of the variance from the submodel
Poisson Regression - ANSWER-The response Y in Poisson regression is assumed to have a Poisson distribution, and this is commonly used for modeling count or rate data.
We assume that the i-th response Yi has a Poisson distribution, with rate lambda i. Alternatively, log of the rate lambda i is equal to the linear combination of the predicting variables
We do not interpret beta with respect to the response variable but with respect to the ratio of the rate
There is no error term
Poisson Regression Assumptions - ANSWER-Linearity: The log transformation of the rate is a linear combination of the predicting variables.
Independence: The response variables are independently obserterm-52ved
Logit: The link function g is the log function. The log link function is almost always used
Linearity Assumption - Poisson - ANSWER-Linearity can be evaluated by plotting the log of the event rate versus the predicting variables
We can also evaluate linearity on the assumption of uncorrelated responses using the scatter plot for the residuals versus the predicting variables
Generalized Linear Models (GLM) - ANSWER-Here, the response Y is assumed to have a distribution from the exponential family of distributions (Normal, Binomial, Poisson, Gamma etc)
Under this model, we model a transformation g of the expectation of Y, given the predicting variables as a linear combination of the predicting variables
We can write the expectation as the inverse of the g transformation of the linear combination of the predicting variables
**Include table w/ link function & regression function pg 67
G transformation - ANSWER-The transformation g is called a link function since it links the expectation of the response to the predicting variables
Poisson Regression vs Log transformed Linear Regression - ANSWER-Standard Regression: We estimate the expectation of the log of the response - E(log(Y))
The variance under the standard regression is assumed constant
Poisson Regression: We estimate the log of the expectation of the response - log(E(Y))
The variance of the response is assumed to be equal to the expectation; thus, the variance is not constant.
**Use the Poisson regression especially when the response data are small counts
**Using the standard linear regression with log transformation instead of Poisson regression, will result in violations of the assumption of constant variance
**Standard Linear Regression could be used if the number of counts are large and with the variance stabilizing transformation √(µ + 3/8) i.e. Square root of the response + 3/8. This transformation will work well for large count when the response data are large counts
Log Rate - ANSWER-This is the log function of the expected value of the response
ln(λ(x)) = β0 + β1x
Regression Coefficient - ANSWER-The regression coefficient is interpreted as the log ratio of the rate with an increase with one unit in the predicting variable [Show Less]