+ - 0:00:00
Notes for current slide
Notes for next slide

Unusual Observations

Dr. D’Agostino McGowan

1 / 27

Leverage

  • leverage is the amount of influence an observation has on the estimation of β^
2 / 27

Leverage

  • leverage is the amount of influence an observation has on the estimation of β^
  • Mathematically, we can define this as the diagonal elements of the hat matrix.
2 / 27

Leverage

  • leverage is the amount of influence an observation has on the estimation of β^
  • Mathematically, we can define this as the diagonal elements of the hat matrix.

    What is the hat matrix?

2 / 27

Leverage

  • leverage is the amount of influence an observation has on the estimation of β^
  • Mathematically, we can define this as the diagonal elements of the hat matrix.

    What is the hat matrix?

  • X(XTX)1XT
2 / 27

Leverage

  • leverage is the amount of influence an observation has on the estimation of β^
  • Mathematically, we can define this as the diagonal elements of the hat matrix.

hi=Hiihi=X(XTX)1XiiT

3 / 27

Leverage

What do we use the diagnonal of the hat matrix?

4 / 27

Leverage

What do we use the diagnonal of the hat matrix?

  • Recall that the variance of the residuals is

var(ei)=σ2(1hi)

  • so large leverage points will pull the fit towards yi
4 / 27

Leverage

  • The leverage, hi, will always be between 0 and 1

How do we know this? Let's show it using the fact that the hat matrix is idempotent and symmetric.

5 / 27

Leverage

  • The leverage, hi, will always be between 0 and 1

How do we know this? Let's show it using the fact that the hat matrix is idempotent and symmetric.

hi=jHijHji

5 / 27

Leverage

  • The leverage, hi, will always be between 0 and 1

How do we know this? Let's show it using the fact that the hat matrix is idempotent and symmetric.

hi=jHijHji=jHij2

6 / 27

Leverage

  • The leverage, hi, will always be between 0 and 1

How do we know this? Let's show it using the fact that the hat matrix is idempotent and symmetric.

hi=jHijHji=jHij2=Hii2+jiHij2

7 / 27

Leverage

  • The leverage, hi, will always be between 0 and 1

How do we know this? Let's show it using the fact that the hat matrix is idempotent and symmetric.

hi=jHijHji=jHij2=Hii2+jiHij2=hi2+jiHij2

8 / 27

Leverage

  • The leverage, hi, will always be between 0 and 1

hi=jHijHji=jHij2=Hii2+jiHij2=hi2+jiHij2

  • This means that hi must be larger than hi2, implying that hi will always be between 0 and 1 ✅
9 / 27

Leverage

  • The ihi=p+1 (remember when we calculated the trace of H?)
10 / 27

Leverage

  • The ihi=p+1 (remember when we calculated the trace of H?)
  • This means an average value for hi is (p+1)/n
10 / 27

Leverage

  • The ihi=p+1 (remember when we calculated the trace of H?)
  • This means an average value for hi is (p+1)/n
  • 👍 A rule of thumb leverages greater than 2(p+1)/n should get an extra look
10 / 27

Standardized residuals

  • We can use the leverages to standardize the residuals
11 / 27

Standardized residuals

  • We can use the leverages to standardize the residuals
  • Instead of plotting the residuals, e, we can plot the standardized residuals

ri=eσ^1hi

11 / 27

Standardized residuals

  • We can use the leverages to standardize the residuals
  • Instead of plotting the residuals, e, we can plot the standardized residuals

ri=eσ^1hi

  • 👍 A rule of thumb for standardized residuals: those greater than 4 would be very unusual and should get an extra look
11 / 27

Application Exercise

y x
1 0
5 4
2 2
2 1
11 10

Using the data above calculate:

  • The leverage for each observation. Are any "unusual"?
  • The standardized residuals, ri.
12 / 27

Doing it in R

  • It is good to understand how to calculate these standardized residuals by hand, but there is an R function that does this for you (rstandard())
  • There is also an R function to calculate the leverage (hatvalues())
13 / 27

Standardized residuals

mod <- lm(mpg ~ disp, data = mtcars)
d <- data.frame(
standardized_resid = rstandard(mod),
fit = fitted(mod)
)
ggplot(d, aes(fit, standardized_resid)) +
geom_point() +geom_hline(yintercept = 0) +
labs(y = "Standardized Residual")

14 / 27

Outliers

  • An outlier is a point that doesn't fit the current model well
15 / 27

Outliers

  • An outlier is a point that doesn't fit the current model well

15 / 27

Outliers

  • An outlier is a point that doesn't fit the current model well

  • This first plot, see a point that is definitely an outlier but it doesn't have much leverage or influence over the fit
16 / 27

Outliers

  • An outlier is a point that doesn't fit the current model well

  • This second plot, see a point that has a large leverage but is not an outlier and doesn't have much influence over the fit
17 / 27

Outliers

  • An outlier is a point that doesn't fit the current model well

  • In the third plot, the point is both an outlier and very influential. Not only is the residual for this point large, but it inflates the residuals for the other points
18 / 27

Outliers

  • To detect points like this third example, it can be prudent to exclude the point and recompute the estimates to get β^(i) and σ^(i)2
19 / 27

Outliers

  • To detect points like this third example, it can be prudent to exclude the point and recompute the estimates to get β^(i) and σ^(i)2

y^(i)=xiTβ^(i)

19 / 27

Outliers

  • To detect points like this third example, it can be prudent to exclude the point and recompute the estimates to get β^(i) and σ^(i)2

y^(i)=xiTβ^(i)

  • If y^(i)yi is large, then observation i is an outlier.
19 / 27

Outliers

  • To detect points like this third example, it can be prudent to exclude the point and recompute the estimates to get β^(i) and σ^(i)2

y^(i)=xiTβ^(i)

  • If y^(i)yi is large, then observation i is an outlier.

How do we determine "large"? We need to scale it using the variance!

19 / 27

Application Exercise

Show that

var^((y^y^(i)))=σ^(i)2(1+xiT(X(i)TX(i))1xi)

y x
1 0
5 4
2 2
2 1
11 10

Using the data above, calculate var^(y^)(i) for observation 5.

20 / 27

Studentized residuals

ti=yiy^(i)σ^(i)(1+xiT(X(i)TX(i))1xi)1/2 The have a t distribution with (n1)(p+1)=np2 degrees of freedom if the model is correct and ϵ N(0,σ2I).

21 / 27

Studentized residuals

ti=yiy^(i)σ^(i)(1+xiT(X(i)TX(i))1xi)1/2 The have a t distribution with (n1)(p+1)=np2 degrees of freedom if the model is correct and ϵ N(0,σ2I).

  • There is an easier way to compute these using the studentized residuals!

ti=ri(np2np1ri2)1/2

21 / 27

Application Exercise

y x
1 0
5 4
2 2
2 1
11 10

Calculate the studentized residuals for the data above.

22 / 27

Studentized residuals

It is good to understand how to calculate these studentized residuals by hand, but there is an R function that does this for you (rstudent())

23 / 27

Influential points

  • A common measure to determine influential points is Cook's D

Di=(y^y^(i))T(y^y^(i))(p+1)σ^2

24 / 27

Influential points

  • A common measure to determine influential points is Cook's D

Di=(y^y^(i))T(y^y^(i))(p+1)σ^2

  • (y^y^(i)) is the change in the fit after leaving observation i out.
24 / 27

Influential points

  • A common measure to determine influential points is Cook's D

Di=(y^y^(i))T(y^y^(i))(p+1)σ^2

  • (y^y^(i)) is the change in the fit after leaving observation i out.
  • This can be calculated using:

1p+1ri2hi1hi

24 / 27

Influential points

  • A common measure to determine influential points is Cook's D

Di=(y^y^(i))T(y^y^(i))(p+1)σ^2

  • (y^y^(i)) is the change in the fit after leaving observation i out.
  • This can be calculated using:

1p+1ri2hi1hi

  • 👍 A rule of thumb is to give observations with Cook's Distance > 4/n an extra look
24 / 27

Cook's Distance

25 / 27

Application Exercise

y x
1 0
5 4
2 2
2 1
11 10

Calculate Cook's Distance for the data above and plot it with the row number on the x-axis and Cook's Distance on the y-axis.

26 / 27

Cook's Distance

It is good to understand how to calculate these by hand, but there is an R function that does this for you (cooks.distance())

27 / 27

Leverage

  • leverage is the amount of influence an observation has on the estimation of β^
2 / 27
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow