class: center, middle, inverse, title-slide # Checking Error Assumptions ### Dr. D’Agostino McGowan --- layout: true <div class="my-footer"> <span> Dr. Lucy D'Agostino McGowan </span> </div> --- ## What are the assumptions for linear regression with respect to `\(\epsilon\)`? -- * Constant Variance -- * Normality -- * Independence --- ## Checking assumptions for the error * We cannot observe the error, `\(\epsilon\)`, themselves. .question[ What can we observe? ] -- * The residuals! (I often write these as `\(e\)` your book writes it as `\(\hat\epsilon\)`) --- ## Residuals versus `\(\epsilon\)` * The residuals are not **exactly** the same as the error .question[ Why? ] -- `$$\begin{align}e &= y - \hat{y}\\ \end{align}$$` --- ## Residuals versus `\(\epsilon\)` * The residuals are not **exactly** the same as the error .question[ Why? ] `$$\begin{align}e &= y - \hat{y}\\ & =(\mathbf{I-H})y\\ \end{align}$$` --- ## Residuals versus `\(\epsilon\)` * The residuals are not **exactly** the same as the error .question[ Why? ] `$$\begin{align}e &= y - \hat{y}\\ & =(\mathbf{I-H})y\\ &=(\mathbf{I-H})\mathbf{X}\beta + (\mathbf{I-H})\epsilon\\ \end{align}$$` --- ## Residuals versus `\(\epsilon\)` * The residuals are not **exactly** the same as the error .question[ Why? ] `$$\begin{align}e &= y - \hat{y}\\ & =(\mathbf{I-H})y\\ &=(\mathbf{I-H})\mathbf{X}\beta + (\mathbf{I-H})\epsilon\\ &=(\mathbf{I-H})\epsilon \end{align}$$` --- ## Residuals versus `\(\epsilon\)` * The residuals are not **exactly** the same as the error .question[ Why? ] `$$\begin{align}e &= y - \hat{y}\\ & =(\mathbf{I-H})y\\ &=(\mathbf{I-H})\mathbf{X}\beta + (\mathbf{I-H})\epsilon\\ &=(\mathbf{I-H})\epsilon \end{align}$$` .question[ What is the variance of e? ] -- `$$\textrm{var}(e) = (\mathbf{I-H})\textrm{var}(\epsilon) = (\mathbf{I-H})\sigma^2$$` --- ## Residuals versus `\(\epsilon\)` * The residuals are not **exactly** the same as the error * Nevertheless, diagnostics can be applied to the residuals to check the assumptions of the error --- ## Constant variance * "Residuals versus fits" plot .question[ What do you think would be on the x-axis and y-axis on a "residuals versus fits" plot? ] --- ## Constant variance ![](16-checking-error-assumptions_files/figure-html/unnamed-chunk-2-1.png)<!-- --> --- ## Constant variance ### Residuals versus fits plot: What are we looking for? * random variation above and below 0 * no apparent "patterns" * the width of the "band" of points is relatively constant --- # Constant Variance .question[ What do you think of this plot? ] ![](16-checking-error-assumptions_files/figure-html/unnamed-chunk-3-1.png)<!-- --> --- # Constant Variance .question[ What do you think of this plot? ] ![](16-checking-error-assumptions_files/figure-html/unnamed-chunk-4-1.png)<!-- --> --- # Constant Variance .question[ What do you think of this plot? ] ![](16-checking-error-assumptions_files/figure-html/unnamed-chunk-5-1.png)<!-- --> --- ## Constant variance * In addition to looking at the residuals versus the fitted values, you can also look at the residuals versus _predictors_ or other `\(x\)` variables on the x-axis. --- ## Checking constant variance in R ```r mod <- lm(mpg ~ disp, mtcars) d <- data.frame( e = resid(mod), fit = fitted(mod) ) ``` -- ```r library(ggplot2) ggplot(d, aes(x = fit, y = e)) + geom_point() + geom_hline(yintercept = 0) ``` --- ## Checking constant variance in R ```r ggplot(d, aes(x = fit, y = e)) + geom_point() + * geom_hline(yintercept = 0) ``` ![](16-checking-error-assumptions_files/figure-html/unnamed-chunk-8-1.png)<!-- --> --- ## Generating data in R .small[ ```r n <- 50 fit <- runif(n) good <- data.frame( fit = fit, e = rnorm(n), type = "good" ) really_bad_var <- data.frame( fit = fit, e = fit * rnorm(n), type = "really bad non-constant variance" ) mildly_bad_var <- data.frame( fit = fit, e = sqrt(fit) * rnorm(n), type = "mildly bad non-constant variance" ) bad_nonlinear <- data.frame( fit = fit, e = cos(fit * pi / 25) * rnorm(n), type = "bad, non-linear" ) d <- rbind(good, really_bad_var, mildly_bad_var, bad_nonlinear) ``` ] --- ## Generating data in R ```r ggplot(d, aes(x = fit, y = e)) + geom_point() + geom_hline(yintercept = 0) + facet_grid(~type) ``` ![](16-checking-error-assumptions_files/figure-html/unnamed-chunk-10-1.png)<!-- --> --- ## Let's look at a bunch of "good" plots ```r good <- function(i, n = 50) { fit <- runif(n) data.frame( fit = fit, e = rnorm(n), run = i ) } d <- purrr::map_df(1:9, good) ``` --- ## Let's look at a bunch of good plots ```r ggplot(d, aes(x = fit, y = e)) + geom_point() + geom_hline(yintercept = 0) + facet_wrap(~run) ``` ![](16-checking-error-assumptions_files/figure-html/unnamed-chunk-12-1.png)<!-- --> --- ## Let's look at a bunch of good plots ![](16-checking-error-assumptions_files/figure-html/unnamed-chunk-13-1.png)<!-- --> --- class: inverse ## <svg style="height:0.8em;top:.04em;position:relative;" viewBox="0 0 640 512"><path d="M512 64v256H128V64h384m16-64H112C85.5 0 64 21.5 64 48v288c0 26.5 21.5 48 48 48h416c26.5 0 48-21.5 48-48V48c0-26.5-21.5-48-48-48zm100 416H389.5c-3 0-5.5 2.1-5.9 5.1C381.2 436.3 368 448 352 448h-64c-16 0-29.2-11.7-31.6-26.9-.5-2.9-3-5.1-5.9-5.1H12c-6.6 0-12 5.4-12 12v36c0 26.5 21.5 48 48 48h544c26.5 0 48-21.5 48-48v-36c0-6.6-5.4-12-12-12z"/></svg> `Application Exercise` Using the code provided in the previous slides, generate multiple "bad" plots that have non-constant variance and discuss what they look like.