David Gerbing
gerbing@pdx.edu
This discussion expands and complements the material in Section 10.3, Indicator Variables, from my book R Data Analysis without Programming.
Begin by providing access to the lessR functions.
library(lessR)
##
## lessR 3.4.9 feedback: gerbing@pdx.edu web: lessRstats.com
## -------------------------------------------------------------------
## 1. Read a text, Excel, SPSS, SAS or R data file from your computer
## into the mydata data table: mydata <- Read()
## 2. For a list of help topics and functions, enter: Help()
## 3. Use theme function for global settings, Ex: theme(colors="gray")
Read the data into the mydata data table, and then list at the console.
mydata <- Read("http://web.pdx.edu/~gerbing/seminar/data/anovaReg.csv", quiet=TRUE)
mydata
## X Y
## 1 A 0
## 2 A 2
## 3 A 4
## 4 B 2
## 5 B 4
## 6 B 6
## 7 C 4
## 8 C 6
## 9 C 8
## 10 D 6
## 11 D 8
## 12 D 10
The data indicate a design of a one-way ANOVA with grouping variable X and response variable Y. The categorical variable X, or factor, has 4 levels (groups): A, B, C and D. There are 3 data values per cell. The key aspect of this data is that is structured by discrete groups, defined by a non-numeric categorical variable, X.
Run the usual one-way analysis of variance with the response variable Y and grouping variable X.
For our purposes here, we do not need to see the entire output of the ANOVA analysis. The output of (most) lessR functions can be saved to an object, and then only desired parts of the object need be displayed. Here save the output of the ANOVA function into the object a. The names of all the components of the output can be obtained from names(a). The names of the text components of output start with out_, and the relevant component here is out_descriptive. Also turn all graphics off as we do not need to see them either.
a <- ANOVA(Y ~ X, graphics=FALSE)
a$out_descriptive
## n mean sd min max
## A 3 2.00 2.00 0.00 4.00
## B 3 4.00 2.00 2.00 6.00
## C 3 6.00 2.00 4.00 8.00
## D 3 8.00 2.00 6.00 10.00
##
## Grand Mean: 5
We observe the following cell means: \(m_A=2, m_B=4, m_C=6, m_D=8\).
Now compare the ANOVA output to that of Regression, here using the abbreviation reg, and again saving to an object, and here displaying only the estimates.
r <- reg(Y ~ X, graphics=FALSE)
r$out_estimates
## Estimate Std Err t-value p-value Lower 95% Upper 95%
## (Intercept) 2.00 1.15 1.732 0.122 -0.66 4.66
## XB 2.00 1.63 1.225 0.256 -1.77 5.77
## XC 4.00 1.63 2.449 0.040 0.23 7.77
## XD 6.00 1.63 3.674 0.006 2.23 9.77
The grouping variable X is categorical, and, of course, a regression requires numerical variables specified in the model. What R does is convert the categorical variable X into what are called indicator variables, one for each category (level) except the first group. This is shown explicitly later, but first let’s examine the regression output.
R refers to the first group listed as the reference group. The regression coefficients are defined in relation to the reference group. Specifically, the intercept is the mean of the reference group, and the slope coefficients are the mean of the corresponding group minus the mean of the reference group. These comparisons are called contrasts or effects.
\[ \begin{aligned} \hat Y &= m_{A} + (m_{B}-m_{A})X_{A} + (m_{C}-m_{A})X_{C} + (m_{D}-m_{A})X_{D}\\ &= 2 + (4 - 2)X_{A} + (6 - 2)X_{C} + (8 - 2)X_{D} \end{aligned} \]
R defined the level A as the reference group not because the data values of A are listed first in the data table. Instead, R orders the levels of a factor, by default, alphabetically. If the categories were Male and Female, then R would set the Female category as the reference group.
What goes on to convert the non-numeric categories into numerical variables? The answer is the contrast matrix, which specifies the corresponding indicator variables that indicate the underlying contrasts to be tested. For such a simple problem the contrast matrix easily can be entered manually, but let’s use the R contrasts function to construct the matrix. By default the function constructs contrasts that correspond to what R did for the one-way ANOVA via regression. After constructing the matrix, convert the matrix to a data frame. The contrast matrix is stored in the object dummy because the type of resulting contrasts are called _dummy contrasts__.
dummy <- contrasts(mydata$X)
dummy <- data.frame(dummy)
dummy
## B C D
## A 0 0 0
## B 1 0 0
## C 0 1 0
## D 0 0 1
The name of each indicator variable is the name of the corresponding group. By examining the placement of any 1 in each row of the data table, the corresponding category can be identified. The reference group, Group A, does not have any 1 for any of the three indicator variables. The name of the first indicator variable is the name of the second group, B, and the values of indicator variable B is 1 just for each row of data for that group.
We wish to merge the contrast matrix by columns with the original data table. To do that we need a variable in common, which would the level of the grouping variable. However, the contrast matrix has the name of each level as the corresponding row name of the table instead of as an actual variable. So make the 4th column of the contrast table the corresponding row name. The R cbind function binds two sets of data together by column.
dummy <- cbind(dummy, row.names(dummy))
names(dummy)[4] <- "X"
dummy
## B C D X
## A 0 0 0 A
## B 1 0 0 B
## C 0 1 0 C
## D 0 0 1 D
Now can merge the data tables by columns. Do what is called a left join. Add to all rows of the first listed data frame the matching columns from the second listed data frame. Here list mydata first and dummy second.
mydata <- Merge(mydata, dummy, by="X", quiet=TRUE)
mydata
## X Y B C D
## 1 A 0 0 0 0
## 2 A 2 0 0 0
## 3 A 4 0 0 0
## 4 B 2 1 0 0
## 5 B 4 1 0 0
## 6 B 6 1 0 0
## 7 C 4 0 1 0
## 8 C 6 0 1 0
## 9 C 8 0 1 0
## 10 D 6 0 0 1
## 11 D 8 0 0 1
## 12 D 10 0 0 1
Now we can run the multiple regression of Y directly onto the indicator variables, which is what R did in the first example, creating the contrast matrix for us.
r <- reg(Y ~ B + C + D, graphics=FALSE)
r$out_estimates
## Estimate Std Err t-value p-value Lower 95% Upper 95%
## (Intercept) 2.00 1.15 1.732 0.122 -0.66 4.66
## B 2.00 1.63 1.225 0.256 -1.77 5.77
## C 4.00 1.63 2.449 0.040 0.23 7.77
## D 6.00 1.63 3.674 0.006 2.23 9.77
The output is identical to the previous regression analysis, except that when R created the contrast matrix automatically, it labeled the variables by concatenating the variable name, X, with the group name: B, C or D. For this second analysis, only the group name is displayed because that is the name of the corresponding indicator variables from the contrast function, though those could be changed with the R names function.
The concept of contrast matrix can be specified many, many different ways, even for this one-way ANOVA. Custom contrasts can be designed to test any set of contrasts desired, not just for one-way designs, but for designs of any complexity, including interaction and nested terms.