class: center, middle, inverse, title-slide # Indicator Variables ### Dr. D’Agostino McGowan --- layout: true <div class="my-footer"> <span> Dr. Lucy D'Agostino McGowan </span> </div> --- ## Variable types * So far, our models have only included _numeric_ (_quantitative_) variables -- * What would the equation be for predicting `\(y\)` from `\(x\)` when `\(x\)` is numeric? -- * What would happen if `\(x\)` is categorical? -- * What would the equation be for predicting `\(y\)` from `\(x\)` if `\(x\)` is categorical with 2 levels? -- * What would the equation be for predicting `\(y\)` from `\(x\)` if `\(x\)` is categorical with 3 levels? --- class: middle ## indicator variable An **indicator variable** uses two values, usually 0 and 1, to indicate whether a data case does (1) or does not (0) belong to a specific category --- ```r library(Stat2Data) data("Diamonds") ``` .small[
] --- ## Indicator variables .question[ What does this line of code do? ] ```r *Diamonds$ColorD <- ifelse(Diamonds$Color == "D", 1, 0) Diamonds$ColorE <- ifelse(Diamonds$Color == "E", 1, 0) Diamonds$ColorF <- ifelse(Diamonds$Color == "F", 1, 0) Diamonds$ColorG <- ifelse(Diamonds$Color == "G", 1, 0) Diamonds$ColorH <- ifelse(Diamonds$Color == "H", 1, 0) Diamonds$ColorI <- ifelse(Diamonds$Color == "I", 1, 0) Diamonds$ColorJ <- ifelse(Diamonds$Color == "J", 1, 0) ``` --- ## Indicator variables .question[ What does this line of code do? ] ```r Diamonds$ColorD <- ifelse(Diamonds$Color == "D", 1, 0) *Diamonds$ColorE <- ifelse(Diamonds$Color == "E", 1, 0) Diamonds$ColorF <- ifelse(Diamonds$Color == "F", 1, 0) Diamonds$ColorG <- ifelse(Diamonds$Color == "G", 1, 0) Diamonds$ColorH <- ifelse(Diamonds$Color == "H", 1, 0) Diamonds$ColorI <- ifelse(Diamonds$Color == "I", 1, 0) Diamonds$ColorJ <- ifelse(Diamonds$Color == "J", 1, 0) ``` --- ## Indicator variables .small[
] --- ## Indicator variables .question[ What if I wanted to model the relationship between `TotalPrice` and `Color`? ] .small[ ```r Diamonds %>% select(TotalPrice, Carat, Color, ColorD, ColorE, ColorF, ColorG, ColorH, ColorI) %>% DT::datatable(options = list(pageLength = 5)) ```
] --- ## Indicator variables .question[ Why is `ColorJ` `NA`? ] .small[ ```r lm(TotalPrice ~ ColorD + ColorE + ColorF + ColorG + ColorH + ColorI + ColorJ, data = Diamonds) ``` ``` ## ## Call: ## lm(formula = TotalPrice ~ ColorD + ColorE + ColorF + ColorG + ## ColorH + ColorI + ColorJ, data = Diamonds) ## ## Coefficients: ## (Intercept) ColorD ColorE ColorF ColorG ColorH ## 1936 3632 2423 7224 7623 6732 *## ColorI ColorJ *## 5704 NA ``` ] -- * When including indicator variables in a model for `k` categories, always include `k-1` -- * The one that is left out is the "reference" category --- ## Indicator variables .question[ What is the reference category? ] .small[ ```r lm(TotalPrice ~ ColorD + ColorE + ColorF + ColorG + ColorH + ColorI, data = Diamonds) ``` ``` ## ## Call: ## lm(formula = TotalPrice ~ ColorD + ColorE + ColorF + ColorG + ## ColorH + ColorI, data = Diamonds) ## ## Coefficients: ## (Intercept) ColorD ColorE ColorF ColorG ColorH ## 1936 3632 2423 7224 7623 6732 ## ColorI ## 5704 ``` ] -- * **Interpretation:** A diamond with Color `D` compared to color `J` increases the expected total price by 3632. -- * **Interpretation:** A diamond with Color `E` compared to color `J` increases the expected total price by 2423 --- ## Indicator variables .question[ What is the reference category? ] .small[ ```r lm(TotalPrice ~ ColorD + ColorE + ColorF + ColorG + ColorH + ColorI, data = Diamonds) ``` ``` ## ## Call: ## lm(formula = TotalPrice ~ ColorD + ColorE + ColorF + ColorG + ## ColorH + ColorI, data = Diamonds) ## ## Coefficients: ## (Intercept) ColorD ColorE ColorF ColorG ColorH ## 1936 3632 2423 7224 7623 6732 ## ColorI ## 5704 ``` ] * **Interpretation:** A diamond with Color `D` compared to color `J` increases the expected total price by 3632. * What is the interpretation for a diamond with Color `F`? --- ## In design matrix form .question[ What would the design matrix look like? ] `$$\begin{bmatrix}1&0&1&0&0&0\\ 1&0&0&1&0&0\\ 1&0&0&0&0&1\\ 1&0&0&1&0&0\\ \vdots&\vdots&\vdots&\vdots&\vdots&\vdots\end{bmatrix}$$` -- .question[ How many columns would this have? ] -- * The number of indicator variables (-1 for the referent) + 1 for the intercept --- ## R is smart .small[ ```r lm(TotalPrice ~ Color, data = Diamonds) ``` ``` ## ## Call: ## lm(formula = TotalPrice ~ Color, data = Diamonds) ## ## Coefficients: ## (Intercept) ColorE ColorF ColorG ColorH ColorI ## 5569 -1209 3592 3990 3100 2071 ## ColorJ ## -3632 ``` ] --- ## R is smart .question[ What is the reference category? ] .small[ ```r lm(TotalPrice ~ Color, data = Diamonds) ``` ``` ## ## Call: ## lm(formula = TotalPrice ~ Color, data = Diamonds) ## ## Coefficients: ## (Intercept) ColorE ColorF ColorG ColorH ColorI ## 5569 -1209 3592 3990 3100 2071 ## ColorJ ## -3632 ``` ] -- * What is the interpretation for Color `E` now? -- * What if we wanted a different referent category? -- * We could code the indicators ourselves -- * We can change the level with the `levels()` function --- ## <svg style="height:0.8em;top:.04em;position:relative;" viewBox="0 0 640 512"><path d="M512 64v256H128V64h384m16-64H112C85.5 0 64 21.5 64 48v288c0 26.5 21.5 48 48 48h416c26.5 0 48-21.5 48-48V48c0-26.5-21.5-48-48-48zm100 416H389.5c-3 0-5.5 2.1-5.9 5.1C381.2 436.3 368 448 352 448h-64c-16 0-29.2-11.7-31.6-26.9-.5-2.9-3-5.1-5.9-5.1H12c-6.6 0-12 5.4-12 12v36c0 26.5 21.5 48 48 48h544c26.5 0 48-21.5 48-48v-36c0-6.6-5.4-12-12-12z"/></svg> `Application Exercise` Input the **design matrix**, `\(\mathbf{X}\)` from the following data in R y | age ---|---- 2 | "<20" 3 | "20-50" 5 | ">50" 10 | "<20" 1 | "20-50" 3 | ">50" Estimate the `\(\beta\)` coefficients for predicting y from age.