13  GLM: Contrasts

Recall that in one-way ANOVA, we have a categorical predictor and a continuous outcome variable. The overall F-test of an ANOVA provides an omnibus (overall) test of differences between group means. The null hypothesis tested in this case is that all \(k\) groups have the same mean, \(H_0: \mu_1 = \mu_2 = \ldots \mu_k\). Today, we delve deeper into tests you can perform when you have more complex hypotheses about group means, or want to perform follow-up tests to determine which groups differ significantly.

First, we already covered that we can use dummy variables to incorporate this categorical predictor in a regression model. We pick one reference category and create dummy variables to indicate membership in the other groups. These dummy variables take binary values (0 or 1) to represent group membership.

The formula for regression with dummies is:

\(\hat{Y}_i = a + b_1D_{1i}+ b_2D_{2i}+ b_3D_{3i}+ b_4D_{4i}\)

The intercept represents the mean value of the reference category, and the coefficients \(b\) represent the differences between each group and that reference category.

A different way to specify the same model is to omit the intercept, and include a dummy for each of the \(k\) groups, so we represent \(k\) groups with \(k\) dummies:

\(\hat{Y}_i = b_1D_{1i}+ b_2D_{2i}+ b_3D_{3i}+ b_4D_{4i}+ b_5D_{5i}\)

Both of these models are mathematically identical. The advantage of estimating all group means is that this model provides a standard error for each group mean, allowing us to test each group mean against hypothesized values.

13.1 Effects Coding

Another way to include a categorical predictor is via effects coding. Effects coding compares each group to the grand mean. When we have unequal group sizes, the coding scheme should account for relative group size.

13.2 Contrast Coding

Contrast coding is yet another coding scheme; it compares groups of means. This allows us to test specific hypotheses about differences between groups. Contrast coding is a very advanced technique that requires you to perform some basic matrix algebra in Excel.

13.3 ‘Post-Hoc’ Tests

The notion of post-hoc tests is a bit outdated; it essentially refers to making all possible comparisons between group means. The name is based on the fact that such tests are rarely hypothesized beforehand. They can be considered an exploratory procedure to look for differences between groups. Note that performing many tests inflates the risk of drawing false-positive conclusions (Type I error).

13.4 Adjusting for Multiple Comparisons

When conducting multiple tests, we face an increased risk of committing Type I errors (false positives). If we perform \(m\) tests within one study, the experiment-wise Type I error rate is \(1- (1-\alpha)^m\). To control the experiment-wise Type I error rate, we can apply a Bonferroni correction, which divides the significance level \(\alpha\) by the number of tests \(m\). This trades off fewer false positive results for more false negative results.

Planning to test specific hypotheses before conducting the study also helps control Type I error.

14 Lecture

15 Tutorial

15.1 Group Means

In this tutorial, you will learn to use the general linear model to estimate means and to test the difference between two group means, the difference between individual group means and the overall mean, and between groups of means.

Open hiking_long.sav in SPSS.

The data file describes the result of a fictitious experiment. A hiking guide has displayed five different types of behavior towards different groups of hikers. The treatment that each person received from the guide is recorded in the variable behavior.

The dependent variable of this experiment is feeling. Higher scores on this variable indicate a more positive attitude of a participant towards the guide. In this assignment, we will use ANOVA to determine whether the mean score on the dependent variable differs between the five experimental conditions.

The data file contains a third variable named weather which can be either good or bad. For now, we will only look at the results obtained during good weather. Hence, we will use “Select cases” to select only those participants with a value of 1 on the weather variable.

Additionally, the data contains a variable named balanced which distinguishes between data resulting from a balanced experiment (with equal sample sizes in all groups), and from an unbalanced experiment (with unequal group sizes). For now, just ignore this variable.

Click Data > Select Cases and select “If condition is satisfied” and click the “If”-button. Now enter the following condition into the equation box:

weather = 1

Now click “Continue” and “Paste” to paste the resulting syntax into the syntax editor. Select Run > All to run it. You should now see in the Data View tab that half of the participants have been crossed out.

First, let’s compute the overall mean of feeling and tabulate the group means.

What is the overall mean of feeling?

What are the group means:

  • What is the mean of the rushing group?
  • What is the mean of the stories group?
  • What is the mean of the insulting group?

To get the mean of feeling, use Analyze -> Descriptive Statistics -> Descriptives.

To get the group means, use Analyze -> Compare Means -> Means

DESCRIPTIVES VARIABLES=feeling 
  /STATISTICS=MEAN STDDEV MIN MAX.
  
MEANS TABLES=feeling BY behavior
  /CELLS=MEAN COUNT STDDEV.

You have previously learned to include categorical variables in a linear model by using dummy coding. Today, we will build upon this principle of encoding the information from a categorical variable into several numerical variables.

First, recall that a linear model with a five-group nominal predictor can be written as follows:

\(\hat{Y} = b_0 + b_1*D_1 + b_2*D_2 + b_3*D_3 + b_4*D_4\)

What is \(b_0\) in this equation?

To estimate the model above using regression, you could code dummy variables as follows:

behavior D1 D2 D3 D4
rushing 1 0 0 0
telling stories 0 1 0 0
insulting 0 0 1 0
making jokes 0 0 0 1
singing 0 0 0 0

What is the reference category in the coding scheme above?

Specify the dummies as described in the table, and estimate the model.

To get the mean of feeling, use Analyze -> Descriptive Statistics -> Descriptives.

To get the group means, use Analyze -> Compare Means -> Means

RECODE behavior (1=1) (2=0) (3=0) (4=0) (5=0) INTO rushing.
RECODE behavior (1=0) (2=1) (3=0) (4=0) (5=0) INTO stories.
RECODE behavior (1=0) (2=0) (3=1) (4=0) (5=0) INTO insulting.
RECODE behavior (1=0) (2=0) (3=0) (4=1) (5=0) INTO jokes.
EXECUTE.

  
REGRESSION
  /MISSING LISTWISE
  /STATISTICS COEFF OUTS R ANOVA
  /CRITERIA=PIN(.05) POUT(.10)
  /NOORIGIN 
  /DEPENDENT feeling
  /METHOD=ENTER rushing stories insulting jokes.

What’s the value of the intercept?

In this analysis, the intercept is the mean value on feeling for the reference category (singing). Verify that this is true by comparing the intercept of this regression to the Means table you made previously.

What is the value of the coefficient for stories?

How can we interpret this coefficient?

15.2 More Dummies

As we’ve previously established, dummy variables allow us to test the significance of mean differences between one reference group and all other groups.

Now, imagine we expect rushing to have a negative effect on behavior, and we want to know which other behaviors are “better” (i.e., result in a higher score on behavior) than rushing.

Specify your hypotheses, then check your answer.

\(H_0: \mu_{rushing} \leq (\mu_{stories}, \mu_{insulting}, \mu_{joking}, \mu_{singing})\)

\(H_1: \mu_{rushing} < (\mu_{stories}, \mu_{insulting}, \mu_{joking}, \mu_{singing})\)

Use dummy variables to test this hypothesis. Note: you will need to specify one additional dummy.

REGRESSION
  /MISSING LISTWISE
  /STATISTICS COEFF OUTS R ANOVA
  /CRITERIA=PIN(.05) POUT(.10)
  /NOORIGIN 
  /DEPENDENT feeling
  /METHOD=ENTER stories insulting jokes singing.

What is the \(R^2\) of this model?

Compare this to the \(R^2\) of your previous model. They should be identical, as should be the overall F-test and p-values. Changing the reference category doesn’t change what information the dummies convey.

Do we perform one-sided or two sided tests?

Use this information to test your hypotheses.

Which behaviors are “better” than rushing?

You can use this technique any time you want to test the significance of the difference between one reference group and other groups.

Note that, in the first assignment, we used a different set of dummy variables than in the second assignment. This means that you can always use different sets of dummy variables when want to compare against multiple reference groups.

15.3 Estimating group means

In the first assignment, we computed the group means. But remember that the ANOVA model allows us to estimate them using the general linear model. In this assignment, we will do so by hand. After the previous assignment, you should have five dummy variables to represent the five groups of behavior.

Until now, you’ve always included four dummy variables to represent five categories, as the last category is represented by the intercept.

However, it is also possible to represent five groups as follows:

\(\hat{Y} = b_1*D_1 + b_2*D_2 + b_3*D_3 + b_4*D_4 + b_5*D_5\)

What is \(b_5\) in this equation?

To estimate the model above using regression, you could code dummy variables as follows (note that you should already have all these dummies from the previous assignment):

behavior D1 D2 D3 D4 D5
rushing 1 0 0 0 0
telling stories 0 1 0 0 0
insulting 0 0 1 0 0
making jokes 0 0 0 1 0
singing 0 0 0 0 1

Now, go to Analyze -> Regression -> Linear, and add all five dummies as predictors.

Then, click the Options button, and notice the option titled “Include Constant in Equation”.

Turn this option off to remove the intercept from the regression equation, then paste your syntax. Notice a new line that says /ORIGIN instead of /NOORIGIN. This command removes the intercept.

Run your syntax, and examine the results.

You might notice that the \(R^2\) and F-test changed. This is because these are computed relative to a null-model with only the intercept - but you told SPSS not to include an intercept, so it can’t compute that null model here. It’s not a big deal. As soon as you estimate models with an intercept again, the \(R^2\)s will be identical again, regardless of the dummy coding.

What’s the value of the dummy for singing?

Compare all coefficients to the table of means from the first assignment. They should all be identical.

This is how you estimate means using the linear model!

Now, what do the t-tests and p-values in the Coefficients table tell us?

Keep in mind that you can use the standard errors from the coefficients table to perform t-tests against other values than 0; for example, what would the test statistic be when testing whether the mean of the insulting group is significantly different from 5, so \(H_0: \mu_{insulting} = 5\)? t =

Is the difference significant?

You can use regression without an intercept any time you wish to estimate all group means in a single analysis and/or test the group means against specific hypothesized values.

15.4 Comparing to Overall Mean

Until now, we’ve represented levels of a categorical variable using dummy variables with values 0 or 1.

In this assignment, we introduce an alternative coding system: effects coding.

The main difference with dummy coding is that the reference category does not receive 0 values on all indicator variables, but instead, receives a negative value. In a balanced design (with equal sizes for each group), this value is -1.

So in a balanced design, with equal group sizes, the coding scheme for effects coding is (I now use the letters E1-E4 to clarify that these are not dummies but effect coded indicators):

behavior E1 E2 E3 E4
rushing 1 0 0 0
telling stories 0 1 0 0
insulting 0 0 1 0
making jokes 0 0 0 1
singing -1 -1 -1 -1

The reference category is still “singing”.

The resulting model will give us the following information:

  • The overall sample mean for feeling
  • The difference between each group mean, except for singing, compared to the overall mean

Most of the time, however, we will not have balanced designs. With this in mind, it is more useful to learn the general way to construct effect coding.

Specifically, the weights assigned for the singing category (reference category) differ for each dummy, and are computed as:

\(-1 * n_{\text{this category}} / n_{\text{reference category}}\)

Check the group sizes in the output from assignment 1.

What is the sample size for the rushing group?

What is the sample size for the reference category?

With this in mind, what should the weight be for the singing group, on the dummy that codes for membership of the rushing group?

Complete the following syntax, then run it:

RECODE behavior (1=1) (2=0) (3=0) (4=0) (5= …) INTO Erushing. RECODE behavior (1=0) (2=1) (3=0) (4=0) (5=) INTO Estories. RECODE behavior (1=0) (2=0) (3=1) (4=0) (5=) INTO Einsulting. RECODE behavior (1=0) (2=0) (3=0) (4=1) (5=…) INTO Ejokes. EXECUTE.

Note the correct answer for the effect code for the stories group. Compare the number of people in the stories group and in the reference group. Then recall that I explained that In a balanced design (with equal sizes for each group), this value is -1. You see that this is true now, and why.

Calculate the effect indicators, then specify a regression model with these four effect indicators. Make sure to include the intercept again!


RECODE behavior (1=1) (2=0) (3=0) (4=0) (5= -1.18) INTO Erushing.
RECODE behavior (1=0) (2=1) (3=0) (4=0) (5=-1) INTO Estories.
RECODE behavior (1=0) (2=0) (3=1) (4=0) (5=-1.18) INTO Einsulting.
RECODE behavior (1=0) (2=0) (3=0) (4=1) (5=-1.18) INTO Ejokes.
EXECUTE.


REGRESSION
  /MISSING LISTWISE
  /STATISTICS COEFF OUTS R ANOVA
  /CRITERIA=PIN(.05) POUT(.10)
  /NOORIGIN 
  /DEPENDENT feeling
  /METHOD=ENTER Erushing Estories Einsulting Ejokes.

Run the syntax. What is the F-value of the model?

Verify that this is the same value you got before in models with an intercept.

What is the value of the intercept?

Verify that this is identical to the overall mean of the dependent variable.

Which group means differ significantly from the overall mean?

Using the coefficients table, calculate the mean of the jokes group. What value do you get?

This should be identical to the mean you observed in the previous assignment (using regression to estimate means), and in the first assignment (just computing the means).

15.5 Comparing Groups of Means

Extending the methods above, it is also possible to compare groups of means. For example, we might wonder whether negative behaviors (rushing and insulting) differ significantly from positive behaviors (stories, jokes, and singing).

This approach builds upon the logic of effects coding, where the weights for the reference category were based on the relative sample size of the reference category. This time, however, the weights for the category to be compared to the reference category are also based on a sample size.

We are going to perform several steps, as explained in the lecture.

15.5.1 Step 1: Plan Contrasts

Keep in mind these rules:

  1. The possible values of each indicator variable must sum to 0.
  2. Each group must be uniquely identified by a particular combination of the contrast variables.

Assume for a moment that we have equal group sizes and want to compare groups 1 and 2 to groups 3, 4, and 5.

Appropriate contrasts would then be (I’m using the letter C to indicate that these are not dummies or effect indicators):

behavior C1 C2 C3 C4
rushing 1 1 0 0
insulting 1 -1 0 0
telling stories -\(\frac{2}{3}\) 0 2 0
making jokes -\(\frac{2}{3}\) 0 -1 1
singing -\(\frac{2}{3}\) 0 -1 -1

Note that:

  • Each column sums to 0
  • Every level of behavior is uniquely identified by some combination of contrasts

In this case, we only care about C1; we created C2, C3 and C4 to ensure that every level of behavior is uniquely identified. But what do C2-C4 test?

C2 compares the two negative behaviors; C3 compares stories against jokes and singing. C4 compares jokes and singing.

15.5.2 Step 2: Account for Group Size

Now, we have to account for the relative sample sizes of these groups to ensure that we can interpret the coefficients as the difference between the means of those combinations of groups.

Use the descriptive statistics you previously obtained to weight the contrasts from step 1.

E.g., contrast C3 below is already completed. Which other contrasts do you still need to change?

behavior C1 C2 C3 C4
rushing 1 1 0 0
insulting 1 -1 0 0
telling stories -\(\frac{2}{3}\) 0 1 0
making jokes -\(\frac{2}{3}\) 0 -13/(11+13) 1
singing -\(\frac{2}{3}\) 0 -11/(11+13) -1

15.5.3 Step 3: Do Matrix Algebra

Enter the complete matrix into a spreadsheet program. Add one column before the contrasts with an intercept for each group, equal to \(1/k\).

What’s the value of this intercept for this study?

  • Click an Empty cell
  • Paste =MINVERSE(TRANSPOSE(
  • Select your contrast matrix
  • Finish the formula by typing closing brackets ))

These are the values you will use for your indicators!

Now, write syntax to create the contrasts using the values you calculated in a spreadsheet. Give these contrasts informative names to help remind yourself of their interpretation. Here is one example; complete the rest yourself:

RECODE behavior (1=.6) (2=.6) (3=-.4) (4=-.4) (5= -.4) INTO posVneg.
RECODE behavior (1=.6) (2=.6) (3=-.4) (4=-.4) (5= -.4) INTO posVneg.
RECODE behavior (1=.5) (2=-.5) (3=0) (4=0) (5= 0) INTO rushVinsult.
RECODE behavior (1=-.01) (2=-.01) (3=.67) (4=-.33) (5= -.33) INTO storyVjokesing.
RECODE behavior (1=.02) (2=.02) (3=.02) (4=.48) (5= -.53) INTO ? .
EXECUTE.

What does the final contrast encode?

15.6 Run the Analysis

Create the indicator variables and run the regression analysis.

What is the mean difference in feeling between rushing and insulting behaviors?

Which effects are significant?

15.7 Adjusting for Multiple Comparisons

In these assignments, we have been conducting many tests. You have learned that the significance level \(\alpha\) indicates the probability of drawing a false-positive conclusion (Type I error). However, these probabilities add up for multiple tests! So when you perform many tests, you can be in a situation where you have a very high probability of comitting at least one Type I error.

We call the total probability of committing at least one Type I error across multiple tests in the same study the “family-wise” or experiment-wise Type I error. You compute it as:

\(P(1+ Type I error) = 1 − (1 − \alpha)^{\text{number of tests}}\)

So if we perform 3 comparisons, the probability of committing at least one Type I error is:

And if we perform 10 tests?

If this makes you uncomfortable - you’re not alone! People often seek to maintain a low risk of drawing any false-positive conclusions, and we can do so simply by lowering \(\alpha\).

15.7.1 Bonferroni correction

Bonferroni proposed a simple correction of \(\alpha = \alpha_{EW}/m\), where \(\alpha_{EW}\) is the desired experiment-wise Type I error rate (e.g., .05), and \(m\) is the number of tests.

What alpha level would you use per test if you want to achieve an experiment-wise alpha of .05 and conduct 7 tests?

The Bonferroni correction is quite conservative; in other words - although Bonferroni helps you avoid false-positive conclusions, it becomes much harder to detect true effects.

15.8 Compare All Groups

Through the ANOVA interface, you can compare all groups to one another. This is equivalent to repeating a regression analysis multiple times, making each category the reference category in turn.

Go to Analyze -> Compare Means -> One Way ANOVA. Enter Feeling as dependent variable and behavior as Factor.

Now, click post-hoc. Note that you can select many different tests. Select LSD; this corresponds to “normal” p-values.

The other tests in this menu will either apply a penalty to the p-value, or compute the test statistic in a different way, with the purpose of adjusting for multiple comparisons.

We will manually apply the correction for multiple comparisons instead, because the fact that SPSS performs the correction behind the scenes has a high risk of user error.

Assuming we perform two-sided tests at \(\alpha = .05\), how many significant differences between group means are there?

Now, apply a Bonferroni correction to the alpha level. How many tests are you performing?

What is the new alpha level?

How many comparisons are still significant when using this new alpha level?