6 GLM-II: Sums of Squares

Last week we discussed how linear regression represents the relationship between a predictor variable (X) and an outcome variable (Y) as a diagonal line. This line will have some prediction error for each individual data point. The regression line, by definition, is the line with the smallest possible overall prediction error (across all participants). Today, we explore this concept of “smallest possible overall prediction error” in more detail.

The sum of prediction errors across all participants is always zero because the regression line passes through the “middle” of the data. So there’s always an equal amount of negative prediction errors and positive ones, which cancel each other out. To calculate the “total prediction error”, we must square the prediction errors, which eliminates the negative values and ensures that we can add them up to a positive number. We call the sum of squared prediction errors the the “sum of squared errors” (SSE). When we estimate a regression model in statistical software, we ask it to find the regression line that minimizes the SSE and give us the line with the smallest prediction errors. For bivariate linear regression, we can calculate this line using matrix algebra (outside the scope of this course); we call this the “ordinary least squares” method.

Now that we know the total amount of prediction error (SSE), we also have a basic measure of goodness of fit for the regression line. However, SSE is not readily interpretable because it lacks a meaningful scale. To assess the goodness of fit relative to a baseline, we compare the SSE of the regression line to the sum of squares we would obtain if we did not use the predictor variable - that is, if we just predicted the mean value for each individual. A model with only the mean and no predictor variables is called the null model. The sum of squared distances between the mean and individual observations is referred to as the Total Sum of Squares (TSS), which represents the average squared distance of individual observations from the mean of Y.

To understand how much of the TSS is explained by the regression line, we calculate the Regression Sum of Squares (RSS). This is the difference between the TSS and the SSE: the reduction in TSS achieved by using the regression line to predict observations instead of just the mean. It indicates how well the regression line explains the variance in the dependent variable.

We can standardize this RSS by dividing it by the SSE, which gives us the “explained variance” \(R^2\), which ranges from 0 to 1. A higher R^2 value indicates that a larger portion of the total variance in the dependent variable is accounted for by the predictor variable. Explained variance is the proportion of the total sum of squares (TSS) that is explained by the regression line (RSS).

Understanding these sums of squares gives us a good foundation for understanding another statistic: the correlation coefficient \(r\). The correlation coefficient describes the strength and direction of the linear relationship between two variables. It differs from regression because regression describes one of these variables as an outcome of the other: an asymmetrical relationship. Correlation instead just describes how strongly these two variables are associated without labeling one as the predictor and the other as the outcome: a symmetrical relationship. The correlation coefficient is a standardized measure of the strength of this association that ranges from -1 (perfectly negatively associated) to 1 (perfectly positively associated). A correlation coefficient of 0 means that there is no association between X and Y. Correlation and regression are very closely related, as the squared correlation coefficient (\(r\), squared) is the same as the measure of explained variance from simple linear regression, \(R^2\), and is also the same as the standardized regression coefficient.

7 Lecture

9 In SPSS

9.1 Correlation Analysis

10 Tutorial

10.1 Bivariate Regression

Social science students were asked about their opinion towards Tilburg’s nightlife, number of Facebook friends, and some other characteristics. The data are in the SocScSurvey.sav file.

Download the file to your computer and open it in SPSS.

Suppose researchers are interested in the relationship between personality and social media use. In particular, they want to know if extraversion explains the number of Facebook friends.

What is the independent variable here?

Remember that the independent variable is the variable that predicts the other variable (which we call the dependent variable). The dependent variable is influenced by the independent variable (its value depends on the independent variable).

Run a regression analysis in which you regress Facebook friends on extraversion (via analyze > regression > linear).

Keep in mind that we “regress the dependent variable Y on the independent variables (X)”.

Consult the output.

Write down the estimated unstandardized regression equation.

\(\text{Friends}_i = -62.377 + 26.788*\text{Extraversion}_i + e_i\)

Which of the following statements is true?

If Facebook friends increases with one unit, extraversion increases with 26.788 unitsIf Extraversion increases with 26.788 units, the number of facebook friends increases with 1 unitIf Extraversion increases with one unit, the number of facebook friends increases with 26.788 units

Remember that the general form of interpretation of the unstandardized effect is: “If X increases with 1 unit, Y increases/decreases with ‘unstandardized regression coefficient’ units”.

What is the value of the standardized regression coefficient?

You can find the standardized regression coefficients in the column called ‘Standardized Coefficients Beta’.

Which of the following statements about the standardized regression coefficients is correct?

If extraversion increases with one unit, the number of facebook friends increases with 0.438 SDsIf extraversion increases with one SD, the number of facebook friends increases with 0.438 unitsIf extraversion increases with one SD, the number of facebook friends increases with 0.438 SDs

Remember that standardized regression coefficients are interpreted in a similar way as unstandardized regression coefficients are, with the one difference being they are interpreted in terms of standard deviations.

Consider the first person in the data file. The person had an extraversion score of 9.

What is the predicted number of Facebook friends for this person?

Consider the first person again.

Given the predicted number of Facebook friends for this person, what is the prediction error (rounded to the nearest integer)?

Prediction error = yobserved - ypredicted

Consider two people, one with an extraversion score of 10 and the other with an extraversion score of 15.

What is the difference in the predicted number of Facebook friends between the two persons? (report the absolute value)

Consult the output of the regression analysis.

What percentage of the total variance in number of Facebook friends can be explained by extraversion?

Consult the ANOVA table.

The table shows the results of an F-test.

What is the default null hypothesis and alternative hypothesis for the reported test?

\(H_0: \rho^2 = 0\), \(H_A: \rho^2 > 0\)

Suppose three researchers test the significance of the R-square.

Researcher I tests at the 10% level, researcher II tests at the 5% level, and researcher III at the 1% level.

Which researcher will reject the null hypothesis?

When reporting the F-test for the model, you would report \(R^2\), the F-test statistic, its degrees of freedom, and the p-value.

The F-test has two distinct degrees of freedom. The first refers to the degrees of freedom for the regression equation, and the second to the degrees of freedom for the residuals. The degrees of freedom are given in brackets. For example, if regression has 2 degrees of freedom and the residuals 100, we write the F-value as F(2,100) = ….

Which of the following F value and corresponding degrees of freedom should be reported?

10.2 Correlation

Correlations and regression analyses can both be used to study the relationship between variables, but there is an important difference.

Discuss with your group mates what the similarities and differences between the two methods are.

A correlation is a symmetric measure of association, meaning we are agnostic about which is the predictor and which is the outcome (or neither are predictor/outcome). The correlation between X and Y is the same as the one between Y and X.

In regression analysis, we do define an independent and dependent variable. The goal is to predict the outcome using the predictor. Most of the time, this implies an assumption of causality - but not necessarily.

For example, we can use regression to predict sales based on customer characteristics without assuming that those characteristics CAUSE sales. But if we want to cause an increase in sales, and we look at the regression coefficients to decide where to intervene - then it suddenly matters a lot whether the predictors are causes of sales or not.

You see this a lot with online marketing when you are receiving a lot of adds for a product that you recently bought. Their regression model knows that looking at the product page is a great predictor of intention to buy it - but they don’t know that the reason you were looking at that page is because you were already buying it.

Now, let’s have a look at the correlation between these two variables.

Analyze > correlate > bivariate.

Choose as variables: Facebook Friends and Extraversion, and click OK.

What is the correlation between Extraversion and number of Facebook friends?

Suppose three researchers test the significance of the correlation between Extraversion and Facebook friends. Researcher I tests at the 10% level, researcher II tests at the 5% level, and researcher III at the 1% level.

Which researcher will reject the null hypothesis?

Which of the following interpretations is true?

We have convincing evidence that Facebook friends and extraversion are associated in the population.It would be very unlikely to observe a sample correlation of .44 by chance if the population correlation would be zero.We don't have convincing evidence that Facebook friends and extraversion are associated in the population.

Compare the correlation coefficient to the standardized regression coefficient from the bivariate regression you conducted previously.

Then, compare it to the value labeled “R” in the “Model Summary” table from the regression.

Square the correlation, and compare it to the value labeled “R Square”.

What do you observe?

If you did everything correctly, you should observe that the bivariate correlation is identical to the standardized regression coefficient. This is only the case with bivariate regression.

Furthermore, the bivariate correlation should be identical to the R reported in the Model Summary table, because they are both just the correlation coefficient. R squared is the squared correlation coefficient, and we interpret it as the “proportion of variance in the outcome explained by the predictor”. Only in bivariate regression is this identical to the squared correlation coefficient.

10.3 R squared

The R squared expresses how well the predictors explain variance in the outcome of a regression. In the next few steps we will look in more detail at this concept.

Consider the results of the regression model again.

Write down the (unstandardized) regression equation based on your previous results, and use the raw data in the Data View to answer the following question.

\(\text{pressure}_i = 37.863 -.320 * \text{variety}_i + e_i\)

What is the predicted value (Y’) for emotional pressure at work for the first person in the data file (i.e., the person with respondent number 1)?

What is the prediction error (a.k.a. the residual) for the first person?

Remember Residual = Yobserved - Ypredicted

We’ve just computed the predicted value and error by hand. It would be very tedious if we would have to do that for all respondents. Fortunately, SPSS offers the option to compute predicted values and errors for all cases for us!

Navigate to Analyze > Regression > Linear

Click on the ‘Save’ button. SPSS opens a new window.

Ask for the Unstandardized predicted values and the unstandardized residuals. Paste and run the syntax.

Let’s inspect the Data View in SPSS again and verify that SPSS added two columns in the data file. One column is labeled PRE_1 and the other RES_1. These columns show the predicted values and residuals for each person, respectively.

You may verify this for the first person (i.e., the values should be the same as you computed in the previous steps).

Now we will look at the variance of the observed values of Pleasure at work, the variance of the predicted values of pleasure at work, and variance of the residuals.

Compute the variances of Emotional pressure, as well as for the predicted values of Emotional pressure, and for the residuals.

Navigate to Analyze > Descriptive statistics > Descriptives Select scemoti, PRE_1, and RES_1. Click on ‘options’ and ask for the Variance. Paste and run the syntax.

How large is the variance of the observed scores for Emotional pressure?

How large is the variance of predicted values of Work pleasure?

How large is the variance of the residuals?

In the previous questions, we looked at three variance components.

Discuss with your group what the three variances represent.

The variance in observed values of Emotional pressure is the total variance in Y (i.e., the dependent variable). The variance in the predicted values of Emotional pressure reflects “differences in emotional pressure that can be explained because some persons have a job with a lot of variety and some have a monotonous job”. This variance component is also known as the explained variance. The variance of the residuals, also known as the residual variance, represents differences in emotional pressure that cannot be attributed to differences in variety at work. Hence, the residual describes differences that are unrelated to variety at work.

In the previous step, we looked at the variances itself, but the numbers are not very informative. A more convenient way to look at the explained variance is proportion wise.

So, let’s use the variances we just generated to calculate the proportion of variance in Emotional pressure that can be explained by Variety at work.

What percentage of the total variance in emotional pressure can be explained by variety at work?

Consult the output of the regression analysis again, particularly the table Model Summary.

Verify that the R-square that is reported in the table is the same as the proportion of explained variance that you have calculated yourself.

Finally, independently go through all the steps of a simple regression analysis using the data file LAS_SS_Work.sav.

Your theory suggests that independence at work predicts emotional pressure.

Construct an appropriate research question and hypotheses.
Conduct the analysis
Describe the relationship (i.e., regression coefficient)?
Discuss the effect size in terms of R2.
Perform a significance test and report your results

Finally, compare the standardized regression coefficient to the R coefficient in the Model Summary table, and optionally to a correlation computed via the Correlation interface. Verify that these are all identical.