7  GLM-III: Binary Predictors

We can examine group differences in a continuous outcome variable using bivariate regression. To do this, group membership must be represented as a binary variable (e.g., gender or ethnicity). To ensure meaningful results, we use dummy coding to represent the binary variable. Dummy coding assigns the value 0 to one category, which serves as the reference category, and the value 1 to the other category. When we include this dummy variable as the predictor in a bivariate linear regression analysis, it will estimate the mean value of the reference category and test the difference between the means of the two categories.

Regression with a binary predictor is completely equivalent to the independent samples t-test. The independent samples t-test is also used to compare the means of two independent groups. In regression, we estimate the slope (b) for the binary predictor, which represents the difference between the means of the two groups. This t-test of the slope in regression is the same as an independent samples t-test.

Both regression with a binary predictor and the independent samples t-test rely on certain assumptions. These include the linearity of the relationship between the binary predictor and the outcome variable, the normality of residuals (the outcome variable should be normally distributed within each group), homoscedasticity (equal variances in both groups), and independence of observations. To check for equal variances, we can use Levene’s test - but keep in mind that “assumption checks” are questionable. If you assume equal variances, report the normal t-test; if you do not assume equal variances, you can report a corrected t-test that allows for different variances. Both are included in SPSS output by default.

To determine the practical significance of a mean difference, we can calculate an effect size measure. Cohen’s d is a commonly used effect size for mean differences. It standardizes the difference between the two group means by the pooled standard deviation. A larger Cohen’s d indicates a greater magnitude of difference between the groups. As a rule of thumb, a small effect size is typically considered around d ≈ 0.2, a medium effect size around d ≈ 0.5, and a large effect size around d ≈ 0.8.

8 Lecture

9 Formative Test

A formative test helps you assess your progress in the course, and helps you address any blind spots in your understanding of the material. If you get a question wrong, you will receive a hint on how to improve your understanding of the material.

Complete the formative test ideally after you’ve seen the lecture, but before the lecture meeting in which we can discuss any topics that need more attention

Question 1

What is the purpose of dummy coding in regression with binary predictors?

Question 2

What does the slope coefficient (b) represent in regression with a binary predictor?

Question 3

Which assumption is not relevant for the independent samples t-test and bivariate linear regression with only a binary predictor?

Question 4

What does the Levene’s test check in the context of the independent samples t-test?

Question 5

How is the independent samples t-test related to the t-test of the slope in regression with a binary predictor?

Question 6

What does the p-value in the context of the independent samples t-test indicate?

Question 7

What does Cohen’s d represent?

Question 8

What is the recommended approach when assumption checks for homoscedasticity are significant?

Question 9

The observed mean difference between two groups is 2.50, and Cohen’s D is 1.25. What is the pooled standard deviation?

Question 10

A researcher accidentally coded a dummy variable as 0 and 2, instead of 0 and 1. The regression equation is Y = 5.66 + 3*D. What is the mean value of the group coded 2?

Question 1

Dummy coding allows regression to include binary predictors by assigning numerical values to each category, estimating the mean of the reference category, and testing the difference between categories.

Question 2

The slope coefficient (b) in regression with a binary predictor represents the difference in means between the two categories, indicating how much the dependent variable changes when the binary predictor changes from 0 to 1.

Question 3

The assumption of linearity is not relevant, because the difference between two binary values of the predictor is linear by definition.

Question 4

Levene’s test checks the assumption of equality of variances in both groups for the independent samples t-test.

Question 5

The independent samples t-test and the t-test of the slope in regression with a binary predictor are equivalent tests that compare means between two independent groups.

Question 6

The p-value indicates the probability of observing a group difference at least as extreme as the one observed, assuming that the null hypothesis is true.

Question 7

Cohen’s d is an effect size that standardizes the difference between group means by the (pooled) standard deviation, making it interpretable on a meaningful scale.

Question 8

Assumption checks can alert you that important assumptions of the test are violated, but you should not blindly adapt analyses based on their results either - particularly in confirmatory research. You can always perform a sensitivity analysis in which you report both the planned analysis and the robust version.

Question 9

Cohen’s D = mean difference/pooled sd.

Question 10

The slope tells you how much the predicted value goes up for a 1-unit increase in the predictor D. Since D is coded 0 and 2, a 1-unit increase only gets you halfway!

10 In SPSS

10.1 Independent Samples t-test

As a t-test and as regression with a dummy predictor:

11 Tutorial

11.1 Independent Samples T-Test

In this assignment we will use the data file 5groups.sav Download 5groups.sav. Download the file and open it in SPSS.

This time, we will compare the means of the variable y of two specific groups: group 1 and group 4. To test the difference between two sample means, we will use the t-test for independent samples.

What is the null hypothesis of this test? And what is the alternative hypothesis?

\(H_0: \mu_1=\mu_4\), against \(H_1: \mu_1\neq \mu_4\)

Create the necessary syntax for the t-test that compares the means of group 1 and group 4.

You can find the dialog for the two-sample t-test under Analyze > Compare Means > Independent Samples T Test

In the SPSS dialog you have to specify which two groups you want to compare. In our case, it’s group 1 and group 4. After placing the variable in the box named “Grouping Variable”, click the button named “Define Groups” to define the groups.

Compare your syntax to the correct syntax:

T-TEST GROUPS=group(1 4) /MISSING=ANALYSIS /VARIABLES=y /CRITERIA=CI(.95).

One of the assumptions of the independent samples t-test is homoscedasticity (equal variances for all levels of the predictor). We can compare the sizes of the variances of the two groups with a simple F-test, which we call Levene’s test.

Have a look at Levene’s test and try to interpret it. Discuss with your group what null-hypothesis is being tested here.

What is the p-value of the Levene’s test?

What do you conclude from this? What’s the practical use of the outcome of this test?

Levene’s test is not significant. Remember that the null hypothesis of Levene’s test is that the population variances of the group are equal. As the p-value is not significant, we cannot reject the null hypothesis. Consequently, there is no evidence that the population variances of two groups are unequal. Thus, there is no reason to question the assumption.

Now you will have to decide on the outcome of the actual t-test. SPSS reports two versions: one that assumes equal variances (top row) and one that relaxes this assumption (bottom row).

You should pick one of these. In principle, you should decide which one you will use before seeing the results - although if there is clear evidence of violation of assumptions, you might want to discuss in your report whether the results change if you use the robust version (bottom row).

For now remember: we assume equal variances.

What is the two-sided p-value?

Do you reject the null hypothesis of this t-test at alpha 0.05?

11.2 Regression with dummies

We will now perform the exact same analysis, but with regression and dummies.

To test the difference between group 1 and group 4, we first create a dummy variable to distinguish these two groups. Use group 1 as reference category. You can use either Transform -> Recode into different variables, or syntax:

RECODE group (1=0) (4=1) INTO dgroup4.
EXECUTE.

Note that all other groups are coded as missing on this variable, which is exactly what we want!

We will use regression to perform our t-test. The hypothesis is the same as in the previous assignment, but you could also rewrite it in terms of regression coefficient(s). What is the null hypothesis of this test in terms of regression coefficient(s)? And what is the alternative hypothesis?

\(H_0: \beta_{group1 vs group2}=0\) which is the same as \(H_0: \mu_{group1} = \mu_{group2}\), versus \(H_1: \beta_{group1 vs group2} \neq 0\) which is the same as \(H_0: \mu_{group1} \ne \mu_{group2}\)

Create the necessary syntax for a regression with the dummy variable that compares the means of group 1 and group 4.

You can find the dialog under Analyze > Regression > Linear

In the SPSS dialog you have to specify the Dependent and Independent variable. In our case, the independent variable is the dummy we created.

Compare your syntax to the correct syntax:

REGRESSION
  /MISSING LISTWISE
  /STATISTICS COEFF OUTS R ANOVA
  /CRITERIA=PIN(.05) POUT(.10)
  /NOORIGIN 
  /DEPENDENT y
  /METHOD=ENTER dgroup4.

Note that, unlike the t-test interface, the regression interface does not provide a Levene’s test. This is one reason you might want to use the t-test interface. The regression interface provides a more generic way to test the assumption of homoscedasticity: a residual plot.

Go back through the regression interface, but this time click the Plots button and plot the predicted value (X = ZPRED) against the residual value (Y = ZRESID).

Your syntax will now say:

REGRESSION
  /MISSING LISTWISE
  /STATISTICS COEFF OUTS R ANOVA
  /CRITERIA=PIN(.05) POUT(.10)
  /NOORIGIN 
  /DEPENDENT y
  /METHOD=ENTER dgroup4
  /SCATTERPLOT=(*ZRESID ,*ZPRED)

If the assumption of homoscedasticity is met, we should see that the dots in this plot are equally distributed around the zero line for all values on the X-axis. In this case, we see much narrower spread on the right side than on the left side.

What can you conclude from this, and does it match your conclusion from Levene’s test?

Now you will have to decide on the outcome of the actual t-test.

Remember that the t-test of the dummy variable should be the same as the t-test we conducted before. Verify that this is true.

What is the two-sided p-value?

We see one more t-test: for the “(Constant)” or intercept. How do we interpret this?