Due: Tuesday 2020-09-22 at 5pm
02-lab-least-squares
)In this lab we will work with one packages: ISLR
.
Install this packages by running the following in the console.
Now that the necessary packages are installed, you should be able to Knit your document and see the results.
If you’d like to run your code in the Console as well you’ll also need to load the packages there. To do so, run the following in the console.
Note that the packages are also loaded with the same commands in your R Markdown document.
The top portion of your R Markdown file (between the three dashed lines) is called YAML. It stands for “YAML Ain’t Markup Language”. It is a human friendly data serialization standard for all programming languages. All you need to know is that this area is called the YAML (we will refer to it as such) and that it contains meta information about your document.
Open the R Markdown (Rmd) file in your project, change the author name to your name, and knit the document.
The data frame we will be working with today is called Hitters
and it’s in the ISLR
package.
To find out more about the dataset, type the following in your Console: ?Hitters
. A question mark before the name of an object will always bring up its help file. This command must be ran in the Console.
Remember: The Console is at the bottom of your RStudio workspace. Things you type in the Console will not be in your final report. This is a good place to peek at data (try typing Hitters
in the Console) and look at help files with the ?
.
Hitters
file have? What are the variables included in the data frame? Add your responses to your lab report.Major League Baseball Data from the 1986 and 1987 seasons.
Hint: A shortcut to determine how many rows are in a data frame is the nrow
function. Run ?nrow
in the Console to learn about how this works.
hitters_cc
with complete case data from the Hitters
data frame. This means we are removing all missing values. Copy the code below to complete this task. How many rows were removed?We are interested in predicting baseball players’ salary using the number of times they were at bat and the number of hits they had. Create a design matrix, X
, that contains two predictor variables from the hitters_cc
data frame. Create a variable y
that contains the outcome data for the players’ salary.
Using the objects created in Exercise 2 (X
and y
), calculate \(\hat\beta\), \(\hat{y}\), the residuals: \(e = (y - \hat{y})\). Interpret what these values mean.
Calculate the variance of \(\hat\beta\) and save as v_beta
. Hint: You can estimate \(\sigma^2\) with the following formula:
\[\hat\sigma^2 = \frac{1}{n-3}\sum_{i=1}^n e_i^2\] Use this to calculate the standard error of your coefficients by running the following formula:
Interpret these results.
Using the cor
function, compute the correlation between \(y\) and \(\hat{y}\) and square this output. We will use this in a future exercise.
Fit the same model as above (i.e. predict baseball players’ salary using the number of times they were at bat and the number of hits they had) using the lm
function. Save this output as an object called lm_fit
. How does this output compare to the results in Exercises 4 and 5?
The residuals
function can be used to extract the residuals from a model object. Run residuals(lm_fit)
. How does this compare to your result in Exercise 4? Hint: for a direct comparison, you can make the residuals a matrix like this: matrix(residuals(lm_fit))
; you may also be interested in the all.equal
function, which tests equality but allows for small differences in numeric results.
The fitted
function can be used to extract \(\hat{y}\) values from a model object. Run fitted(lm_fit)
. How does this compare to your result in Exercise 4?
The summary
function can be used to to calculate a summary of the model fit. Use the summary
function on the lm_fit
object created in Exercise 7 and save this as lm_summary
. This will save lots of pieces of information about your model. Returning to Exercise 6, where do you see the value calculated here in the lm_summary
output? What does this value mean?
The lm_summary
object contains a list
with lots of important summary statistics. To see all of the pieces here, run names(lm_summary)
. To see what each of these components corresponds to, run ?summary.lm
in the Console and scroll down to the Value
section of the help file. Describe the 11 summary statistics contained here.
Examine the estimate for \(\sigma\) from the your linear model using lm_summary$sigma
. How does this compare to \(\sigma^2\) that you calculated in Exercise 4?
Examine the estimate for \((\mathbf{X}^T\mathbf{X})^{-1}\) from your linear model using lm_summary$cov.unscaled
. How does this compare to your estimate of \((\mathbf{X}^T\mathbf{X})^{-1}\) from Exercise 4?
A collaborator tells you they would like to fit the model:
\[Salary = \beta_0 + \beta_1 AtBat + \beta_2 Hits + \beta_3 2\times Hits + \epsilon\] Where the predictors are AtBat
, Hits
, and 2Hits
which is equal to twice the number of hits. Does this make sense? Why or why not?