Checking Error Assumptions

class: center, middle, inverse, title-slide

# Checking Error Assumptions
### Dr. D’Agostino McGowan

---

layout: true

<div class="my-footer">
<span>
Dr. Lucy D'Agostino McGowan
</span>
</div>

---

## What are the assumptions for linear regression with respect to `\(\epsilon\)`?

--

* Constant Variance
--

* Normality
--

* Independence

---

## Checking assumptions for the error

* We cannot observe the error, `\(\epsilon\)`, themselves.

.question[
What can we observe?
]

--

* The residuals! (I often write these as `\(e\)` your book writes it as `\(\hat\epsilon\)`)

---

## Residuals versus `\(\epsilon\)`

* The residuals are not **exactly** the same as the error

.question[
Why?
]

--

`$$\begin{align}e &= y - \hat{y}\\
\end{align}$$`

---

## Residuals versus `\(\epsilon\)`

* The residuals are not **exactly** the same as the error

.question[
Why?
]

`$$\begin{align}e &= y - \hat{y}\\
& =(\mathbf{I-H})y\\
\end{align}$$`

---

## Residuals versus `\(\epsilon\)`

* The residuals are not **exactly** the same as the error

.question[
Why?
]

`$$\begin{align}e &= y - \hat{y}\\
& =(\mathbf{I-H})y\\
&=(\mathbf{I-H})\mathbf{X}\beta + (\mathbf{I-H})\epsilon\\
\end{align}$$`

---

## Residuals versus `\(\epsilon\)`

* The residuals are not **exactly** the same as the error

.question[
Why?
]

`$$\begin{align}e &= y - \hat{y}\\
& =(\mathbf{I-H})y\\
&=(\mathbf{I-H})\mathbf{X}\beta + (\mathbf{I-H})\epsilon\\
&=(\mathbf{I-H})\epsilon
\end{align}$$`

---

## Residuals versus `\(\epsilon\)`

* The residuals are not **exactly** the same as the error

.question[
Why?
]

`$$\begin{align}e &= y - \hat{y}\\
& =(\mathbf{I-H})y\\
&=(\mathbf{I-H})\mathbf{X}\beta + (\mathbf{I-H})\epsilon\\
&=(\mathbf{I-H})\epsilon
\end{align}$$`

.question[
What is the variance of e?
]

--

`$$\textrm{var}(e) = (\mathbf{I-H})\textrm{var}(\epsilon) = (\mathbf{I-H})\sigma^2$$`

---

## Residuals versus `\(\epsilon\)`

* The residuals are not **exactly** the same as the error
* Nevertheless, diagnostics can be applied to the residuals to check the assumptions of the error

---

## Constant variance

* "Residuals versus fits" plot

.question[
What do you think would be on the x-axis and y-axis on a "residuals versus fits" plot?
]

---

## Constant variance

![](16-checking-error-assumptions_files/figure-html/unnamed-chunk-2-1.png)

---

## Constant variance

### Residuals versus fits plot: What are we looking for?

* random variation above and below 0
* no apparent "patterns"
* the width of the "band" of points is relatively constant

---

# Constant Variance

.question[
What do you think of this plot?
]

![](16-checking-error-assumptions_files/figure-html/unnamed-chunk-3-1.png)

---

# Constant Variance

.question[
What do you think of this plot?
]

![](16-checking-error-assumptions_files/figure-html/unnamed-chunk-4-1.png)

---

# Constant Variance

.question[
What do you think of this plot?
]

![](16-checking-error-assumptions_files/figure-html/unnamed-chunk-5-1.png)

---

## Constant variance

* In addition to looking at the residuals versus the fitted values, you can also look at the residuals versus _predictors_ or other `\(x\)` variables on the x-axis.

---

## Checking constant variance in R

```r
mod <- lm(mpg ~ disp, mtcars)
d <- data.frame(
  e = resid(mod),
  fit = fitted(mod)
)
```
--

```r
library(ggplot2)
ggplot(d, aes(x = fit, y = e)) +
  geom_point() +
  geom_hline(yintercept = 0)
```

---

## Checking constant variance in R

```r
ggplot(d, aes(x = fit, y = e)) +
  geom_point() +
* geom_hline(yintercept = 0)
```

![](16-checking-error-assumptions_files/figure-html/unnamed-chunk-8-1.png)

---

## Generating data in R

.small[

```r
n <- 50
fit <- runif(n)
good <- data.frame(
  fit = fit,
  e = rnorm(n),
  type = "good"
)

really_bad_var <- data.frame(
  fit = fit,
  e = fit * rnorm(n),
  type = "really bad non-constant variance"
)

mildly_bad_var <- data.frame(
  fit = fit,
  e = sqrt(fit) * rnorm(n),
  type = "mildly bad non-constant variance"
)

bad_nonlinear <- data.frame(
  fit = fit,
  e = cos(fit * pi / 25) * rnorm(n),
  type = "bad, non-linear"
)

d <- rbind(good, really_bad_var, mildly_bad_var, bad_nonlinear)
```
]

---

## Generating data in R

```r
ggplot(d, aes(x = fit, y = e)) +
  geom_point() + 
  geom_hline(yintercept = 0) +
  facet_grid(~type)
```

![](16-checking-error-assumptions_files/figure-html/unnamed-chunk-10-1.png)

---

## Let's look at a bunch of "good" plots

```r
good <- function(i, n = 50) {
  fit <- runif(n)
  data.frame(
    fit = fit,
    e = rnorm(n),
    run = i
  )
}

d <- purrr::map_df(1:9, good)
```

---

## Let's look at a bunch of good plots

```r
ggplot(d, aes(x = fit, y = e)) +
  geom_point() +
  geom_hline(yintercept = 0) +
  facet_wrap(~run)
```

![](16-checking-error-assumptions_files/figure-html/unnamed-chunk-12-1.png)

---

## Let's look at a bunch of good plots

![](16-checking-error-assumptions_files/figure-html/unnamed-chunk-13-1.png)

---

class: inverse

## <svg style="height:0.8em;top:.04em;position:relative;" viewBox="0 0 640 512"><path d="M512 64v256H128V64h384m16-64H112C85.5 0 64 21.5 64 48v288c0 26.5 21.5 48 48 48h416c26.5 0 48-21.5 48-48V48c0-26.5-21.5-48-48-48zm100 416H389.5c-3 0-5.5 2.1-5.9 5.1C381.2 436.3 368 448 352 448h-64c-16 0-29.2-11.7-31.6-26.9-.5-2.9-3-5.1-5.9-5.1H12c-6.6 0-12 5.4-12 12v36c0 26.5 21.5 48 48 48h544c26.5 0 48-21.5 48-48v-36c0-6.6-5.4-12-12-12z"/></svg> `Application Exercise`

Using the code provided in the previous slides, generate multiple "bad" plots that have non-constant variance and discuss what they look like.