class: center, middle, inverse, title-slide # F-tests in R ### Dr. D’Agostino McGowan --- layout: true <div class="my-footer"> <span> Dr. Lucy D'Agostino McGowan </span> </div> --- ## Hitters data ```r library(ISLR) hitters_cc <- Hitters[complete.cases(Hitters), ] ``` -- * We want to predict `Salary` from `AtBat` and `Hits` -- * Is this model better than an intercept only model? --- ## Hypothesis testing * `\(H_0:\beta_1=\beta_2=0\)` -- `$$F = \frac{RSS_{small} - RSS_{larger} / (df_{small}- df_{larger})}{RSS_{larger}/df_{larger}}\sim F_{df_{small}- df_{larger}, df_{larger}}$$` --- ## Let's do it in R! ```r small <- lm(Salary ~ 1, data = hitters_cc) larger <- lm(Salary ~ AtBat + Hits, data = hitters_cc) ``` .question[ How do we get the RSS from this model? ] -- .small[ ```r rss_small <- summary(small)$sigma^2 * (nrow(hitters_cc) - 1) rss_larger <- summary(larger)$sigma^2 * (nrow(hitters_cc) - 3) ``` ] -- * There is an easier way! ```r rss_small <- deviance(small) rss_larger <- deviance(larger) ``` --- ## Hypothesis testing * `\(H_0:\beta_1=\beta_2=0\)` `$$F = \frac{RSS_{small} - RSS_{larger} / (df_{small}- df_{larger})}{RSS_{larger}/df_{larger}}\sim F_{df_{small}- df_{larger}, df_{larger}}$$` -- ```r f <- ((rss_small - rss_larger) / 2) / (rss_larger / (nrow(hitters_cc) - 3)) f ``` ``` ## [1] 33.23299 ``` -- ```r 1 - pf(f, 2, nrow(hitters_cc) - 3) ``` ``` ## [1] 1.405542e-13 ``` --- ## F-tests in R .pull-left[ .small[ ```r f <- ((rss_small - rss_larger) / 2) / (rss_larger / (nrow(hitters_cc) - 3)) f ``` ``` ## [1] 33.23299 ``` ] ] .pull-right[ .small[ ```r 1 - pf(f, 2, nrow(hitters_cc) - 3) ``` ``` ## [1] 1.405542e-13 ``` ]] -- * There is an easier way! ```r anova(small, larger) ``` ``` ## Analysis of Variance Table ## ## Model 1: Salary ~ 1 ## Model 2: Salary ~ AtBat + Hits ## Res.Df RSS Df Sum of Sq F Pr(>F) ## 1 262 53319113 ## 2 260 42463750 2 10855363 33.233 1.405e-13 ``` --- ## Testing one variable .question[ What if we wanted to know if the `Hits` variable is an important contribution? ] -- * `\(H_0: \beta_2 = 0\)` * `\(H_A: \beta_2 \neq 0\)` -- .small[ ```r small <- lm(Salary ~ AtBat, data = hitters_cc) larger <- lm(Salary ~ AtBat + Hits, data = hitters_cc) anova(small, larger) ``` ``` ## Analysis of Variance Table ## ## Model 1: Salary ~ AtBat ## Model 2: Salary ~ AtBat + Hits ## Res.Df RSS Df Sum of Sq F Pr(>F) ## 1 261 45009644 ## 2 260 42463750 1 2545894 15.588 0.0001014 ``` ] --- ## Testing one variable .question[ What if we wanted to know if the `Hits` variable is an important contribution? ] * `\(H_0: \beta_2 = 0\)` * `\(H_A: \beta_2 \neq 0\)` .small[ ```r small <- lm(Salary ~ AtBat, data = hitters_cc) larger <- lm(Salary ~ AtBat + Hits, data = hitters_cc) anova(small, larger) ``` ``` ## Analysis of Variance Table ## ## Model 1: Salary ~ AtBat ## Model 2: Salary ~ AtBat + Hits ## Res.Df RSS Df Sum of Sq F Pr(>F) ## 1 261 45009644 ## 2 260 42463750 1 2545894 15.588 0.0001014 ``` ] * There is an easier way! --- ## Testing one variable .small[ ``` ## Analysis of Variance Table ## ## Model 1: Salary ~ AtBat ## Model 2: Salary ~ AtBat + Hits ## Res.Df RSS Df Sum of Sq F Pr(>F) ## 1 261 45009644 *## 2 260 42463750 1 2545894 15.588 0.0001014 ``` ] .small[ ```r summary(larger) ``` ``` ## ## Call: ## lm(formula = Salary ~ AtBat + Hits, data = hitters_cc) ## ## Residuals: ## Min 1Q Median 3Q Max ## -1006.05 -247.38 -79.15 179.41 2002.17 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 141.2720 76.5526 1.845 0.066113 ## AtBat -1.2160 0.6372 -1.908 0.057430 *## Hits 8.2119 2.0799 3.948 0.000101 ## ## Residual standard error: 404.1 on 260 degrees of freedom ## Multiple R-squared: 0.2036, Adjusted R-squared: 0.1975 ## F-statistic: 33.23 on 2 and 260 DF, p-value: 1.405e-13 ``` ] --- class: inverse ## <svg style="height:0.8em;top:.04em;position:relative;" viewBox="0 0 640 512"><path d="M512 64v256H128V64h384m16-64H112C85.5 0 64 21.5 64 48v288c0 26.5 21.5 48 48 48h416c26.5 0 48-21.5 48-48V48c0-26.5-21.5-48-48-48zm100 416H389.5c-3 0-5.5 2.1-5.9 5.1C381.2 436.3 368 448 352 448h-64c-16 0-29.2-11.7-31.6-26.9-.5-2.9-3-5.1-5.9-5.1H12c-6.6 0-12 5.4-12 12v36c0 26.5 21.5 48 48 48h544c26.5 0 48-21.5 48-48v-36c0-6.6-5.4-12-12-12z"/></svg> `Application Exercise` Using the Hitters data: * Fit a model predicting `Salary` from `AtBat` and `HmRun` * Perform an F-test for the null hypothesis that the coefficient associated with `HmRun` is 0 * Compare this to the t-test obtained from the `summary` statement.