Chapter 11 Week 2 - Class
In the unlikely event that you were able to replicate the results of Kestilä in the take-home exercise, rerun your analysis script. If you did not manage to replicate the results, load the dataset ESSround1-b.csv
into RStudio.
11.0.1 Loading the data
If you don’t know how to load the data, click below.
Click for explanation
<- read.csv("ESSround1-b.csv", stringsAsFactors = FALSE) df
11.0.2 Question 1
Kestilä states that running a Principal Components Analysis is a good way to test whether the survey questions in the ESS measure attitudes towards immigration and trust in politics. Based on your reading of Preacher and MacCallum (2003), do you agree with this position?
11.0.3 Question 2
If you would have to choose a method for constructing the ‘trust in politics’ and ‘attitude towards immigration’ scales based on the theory and background information in the Kestilä article, what type of factor analysis would you choose? What key factors influence your decision?
Click for information
Key factors include:
- Theory-driven or exploratory?
- Estimation method
- Rotation method
- Method to establish how many factors are needed
11.0.4 Question 3
Run two factor analyses, one for each PCA of the original article. Inspect the number of factors necessary, evaluate the rotation method, and if necessary, run the factor analysis again with adapted settings (rotation method and/or different number of factors). How many factors are there?
Hint: Examing the help file for ?psych::fa
Click for explanation
To perform an exploratory factor analysis, you can use the function fa
of the package psych
. You will have to specify the data, and the variables that you want to include in the factor analyses. Furthermore, you will have to specify the number of factors that you want to extract, the rotation method and the estimation method.
library(psych)
library(GPArotation)
<- fa(df[, 7:19], nfactors = 3, rotate = "promax", fm = "ml", scores = "Bartlett") efa_trust
In order to determine the number of factors to extract, you might want to look at the eigenvalues of the factors, which can be accessed by using the following code:
round(efa_trust$values, digits = 3)
## [1] 5.054 0.874 0.687 0.304 0.185 0.076 0.014 0.008 -0.060 -0.078
## [11] -0.123 -0.149 -0.191
Another common approach is to make a scree plot. All that is required for this, is to pass the eigenvalues to the qplot()
function and add a geom_path:
library(ggplot2)
qplot(y = efa_trust$values) + geom_path()
You can do the same for the attitude variables:
<- fa(df[, 20:44], nfactors = 5, rotate = "promax", fm = "ml", scores = "Bartlett")
efa_att round(efa_att$values, digits = 3)
## [1] 8.016 1.702 1.051 0.697 0.500 0.389 0.218 0.162 0.091 0.046
## [11] 0.043 0.028 0.022 0.008 -0.018 -0.028 -0.032 -0.056 -0.066 -0.083
## [21] -0.089 -0.119 -0.132 -0.146 -0.266
qplot(y = efa_att$values) + geom_path()
As recommended in the lecture, a better and more quantified way to establish the number of factors is “parallel analysis” (Horn, 1965). Here’s how to do it in R:
fa.parallel(df[, 7:19])
## Parallel analysis suggests that the number of factors = 6 and the number of components = 3
Note that, in this case, parallel analysis recommends as many as 6 factors. However, in order to be able to compare our results with those from Kestilä, we might consider aligning ourselves with the literature and choosing the same number of factors for our analyses as components in the article.
11.0.5 Question 4
Apart from the number of factors, you also want to look at the factor loadings. They can be found in the “pattern matrix.” The higher the factor loadings are, the more indicative an item is for the latent factor. If you find some items to have only very low loadings (indicating that the items do not provide much information about the factor), you may choose not to include them in your analysis. This means you have to rerun the analysis under question 3.
Click for explanation
You can find the factor loadings by means of the ‘print’-function used in the take-home exercise, or you can search for the variable ‘loadings,’ which is inside the results object, to end up with just the information you are searching for.
$loadings efa_trust
##
## Loadings:
## ML1 ML2 ML3
## pltcare 0.784 -0.126
## pltinvt 0.783 -0.131
## trstprl 0.528 0.101 0.258
## trstlgl 0.827
## trstplc -0.157 0.800
## trstplt 0.713 0.116
## trstep 0.432 0.303
## trstun 0.343 0.366
## stfeco 0.128 0.725 -0.129
## stfgov 0.270 0.640 -0.132
## stfdem 0.201 0.479 0.129
## stfedu -0.166 0.665 0.104
## stfhlth -0.137 0.632
##
## ML1 ML2 ML3
## SS loadings 2.518 2.035 1.725
## Proportion Var 0.194 0.157 0.133
## Cumulative Var 0.194 0.350 0.483
$loadings efa_att
##
## Loadings:
## ML3 ML1 ML5 ML4 ML2
## imsmetn 0.479 0.403
## imdfetn 0.321 0.563
## eimrcnt 1.110 -0.187
## eimpcnt 0.345 0.665
## imrcntr 0.798
## impcntr 0.259 0.727
## qfimchr 0.126 0.867
## qfimwht 0.114 0.749
## imwgdwn 0.511 0.182
## imhecop 0.560 0.153
## imtcjob 0.712 0.138
## imbleco 0.702 0.162
## imbgeco 0.740
## imueclt 0.470 -0.166 -0.166
## imwbcnt 0.639 -0.169
## imwbcrm 0.515 -0.177 0.101
## imrsprc 0.534 0.159
## pplstrd 0.245 -0.365
## vrtrlg -0.115 0.186 0.270
## shrrfg 0.302 -0.280
## rfgawrk 0.474
## gvrfgap 0.752
## rfgfrpc 0.219 -0.255
## rfggvfn 0.489
## rfgbfml 0.632
##
## ML3 ML1 ML5 ML4 ML2
## SS loadings 3.271 2.397 2.061 1.648 1.549
## Proportion Var 0.131 0.096 0.082 0.066 0.062
## Cumulative Var 0.131 0.227 0.309 0.375 0.437
The matrix of loadings indicates how strongly each factor (columns) is associated with the items (rows).
Below the matrix of loadings, we see a second matrix, which indicates (amongst other things) the
proportion var: How much variance in the items is explained by each of the factors.
Each subsequent factor explains slightly less variance than the ones before it (this is a property of exploratory factor analysis).
The cumulative var indicates how much variance the factors explain, in total. If you estimated as many factors as items, then the last value for cumulative var would be 1.00
(100%).
The factor loading matrix is slightly hard to read, due to the jumble of factor loadings. To create more clarity, it is convenient to suppress the factor loadings that are lower than .30.
print(efa_trust$loadings, cut = .30, digits = 2)
##
## Loadings:
## ML1 ML2 ML3
## pltcare 0.78
## pltinvt 0.78
## trstprl 0.53
## trstlgl 0.83
## trstplc 0.80
## trstplt 0.71
## trstep 0.43 0.30
## trstun 0.34 0.37
## stfeco 0.73
## stfgov 0.64
## stfdem 0.48
## stfedu 0.67
## stfhlth 0.63
##
## ML1 ML2 ML3
## SS loadings 2.52 2.03 1.72
## Proportion Var 0.19 0.16 0.13
## Cumulative Var 0.19 0.35 0.48
Furthermore, if you want to perform a factor analysis without, say, stfedu
, while you want all other variables included in your factor analysis, you can simply leave the column number of stfedu
, which is 13, out of the command:
<- fa(df[, c(7:12, 14:19)], nfactors = 3, rotate = "promax", fm = "ml") efa_trust_without_stfedu
11.0.6 Question 5
Give the factor scores an appropriate name. You can do this by inspecting the items that load on one factor. What do these items have in common substantively? The goal of a factor analysis usually is to create interpretable factors. If you have trouble interpreting the factors, you can choose to tweak the analysis by changing the options, or including/excluding more items.
Furthermore, after you named the factor scores accordingly, extract them from the results object and add them to the data.frame.
Hint: If you do not know how to do this, have a look at question 1.h from the take-home exercise of week 2.
Please note that the colnames
will be specified from left to right, and not, for example, from ML1 to ML5.
11.0.7 Question 6
The next step is to assert whether the items that together form one factor, also form a reliable scale. Run separate reliability analyses by means of the function alpha
for the items that together form one factor, and evaluate Cronbach’s alpha to see whether the scales are internally consistent. The “Reliability if an item is dropped (alpha.drop)” information may be handy to inspect what would happen if you would delete one item; you can find it inside the reliability analysis object. If Cronbach’s alpha is not ok, deselect one survey item and run the analyses under question 4 and question 5 again.
Hint: Cronbach’s alpha > .7 are deemed to be ok, > .8 is good.
If Cronbach’s alpha is not ok, deselect one survey item and run the analyses under question 4 and question 5 again.
Click for explanation
If you want to assess the reliability of the variables pltcare
, pltinvt
, trstprl
, trstplt
, and trstep
you can run a reliability analysis as follows.
Hint: name the new objects substantively, instead of numbering them.
library(psych)
<- psych::alpha(df_with_scores[, c(7, 8, 9, 12, 13)])
reli_1 reli_1
##
## Reliability analysis
## Call: psych::alpha(x = df_with_scores[, c(7, 8, 9, 12, 13)])
##
## raw_alpha std.alpha G6(smc) average_r S/N ase mean sd median_r
## 0.8 0.83 0.82 0.49 4.8 0.0019 4.4 1.4 0.52
##
## lower alpha upper 95% confidence boundaries
## 0.8 0.8 0.81
##
## Reliability if an item is dropped:
## raw_alpha std.alpha G6(smc) average_r S/N alpha se var.r med.r
## pltcare 0.79 0.80 0.77 0.50 4.0 0.0021 0.017 0.52
## pltinvt 0.80 0.81 0.78 0.51 4.2 0.0021 0.016 0.53
## trstprl 0.73 0.78 0.76 0.47 3.6 0.0027 0.018 0.52
## trstplt 0.70 0.75 0.73 0.43 3.1 0.0031 0.014 0.43
## trstep 0.78 0.82 0.80 0.53 4.6 0.0021 0.010 0.52
##
## Item statistics
## n raw.r std.r r.cor r.drop mean sd
## pltcare 17975 0.66 0.75 0.67 0.56 2.6 1.1
## pltinvt 17971 0.64 0.74 0.65 0.54 2.4 1.1
## trstprl 17753 0.84 0.80 0.74 0.69 6.3 2.3
## trstplt 17966 0.88 0.86 0.83 0.77 5.3 2.2
## trstep 16390 0.77 0.70 0.59 0.57 5.6 2.3
Sometimes, the table “Reliability if an item is dropped” will indicate that Cronbach’s alpha increases when you drop a variable out of the analysis. Note, however, that doing so is an exploratory approach to analysis, and it may make your work incomparable to other publications using the same scale.
11.0.8 Question 7
Now you can analyze the differences between the factor scores for the PCA analysis (take-home exercise 2) and the EFA by plotting them in a series of scatterplots (bivariate) using ggplot2
. The PCA factor scores are already stored in the dataset ESSround1-b.csv
in columns 45 through 52 (df[, 45:52]
).
Plot the scores for one of your new factors on the x axis, and try to match it with a corresponding PCA component in the dataset. Note that if you went with a different number of factors than Kestilä’s components (5 for ‘trust in politics’ and 3 for ‘attitudes towards immigration’), you may find little correlation between any factors.
Click for explanation
Make sure to adjust the arguments x = EFA_trust1
and y = PCA_Trust1
to align with your own EFA factor column name and a PCA component column name from the data respectively
library(ggplot2)
ggplot(data = df_with_scores) +
geom_point(mapping = aes(x = EFA_Trust1, y = PCA_Trust1))
11.0.9 Question 8
Examine the correlations between the PCA and EFA scales for the ‘trust in politics’ scores, and for the ‘immigration’ scores. What is your conclusion: is there a difference between them?
Click for explanation
Hint: Name the new objects substantively once again.
Example for trust in politics:
library(psych)
corr.test(df_with_scores$EFA_Trust1, df_with_scores$PCA_Trust1, use = "complete.obs")
## Call:corr.test(x = df_with_scores$trustinstEFA, y = df_with_scores$trustinst,
## use = "complete.obs")
## Correlation matrix
## [1] 0.94
## Sample Size
## [1] 14778
## These are the unadjusted probability values.
## The probability values adjusted for multiple tests are in the p.adj object.
## [1] 0
##
## To see confidence intervals of the correlations, print with the short=FALSE option
11.0.10 Question 9
Kestila uses the PCA factor scores to evaluate country level differences in 1. Attitudes towards immigration and 2. Political trust. Repeat her analyses using the factor scores you saved in step 5. Think about the statistical test you would like to use. Do you draw similar or different conclusions?
Click for explanation
We can use an ANOVA to test whether the countries differ in the amount of political trust the participants have.
<- aov(trustinstEFA ~ cntry, data = df_with_scores)
fit_ANOVA summary(fit_ANOVA)
## Df Sum Sq Mean Sq F value Pr(>F)
## cntry 8 1740 217.55 199.6 <2e-16 ***
## Residuals 14769 16100 1.09
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 3409 observations deleted due to missingness
In which trustinstEFA
is the dependent variable, cntry
is the grouping variable, and df
is the name of the dataset; summary
will provide the results. However, it will only indicate whether the variable cntry
is significant or not, so you will be unable to tell which countries differ in terms of their political trust. To tell which countries differ, you can do a pairwise comparison test, with a Bonferroni adjustment for multiple testing.
<- pairwise.t.test(df_with_scores$trustinstEFA, df_with_scores$cntry, p.adjust.method = "bonf", na.rm = TRUE)
pair_comparison round(pair_comparison$p.value, digits = 3)
## Austria Belgium Denmark Finland Germany Italy Netherlands Norway
## Belgium 0.000 NA NA NA NA NA NA NA
## Denmark 0.000 0 NA NA NA NA NA NA
## Finland 0.000 0 0.085 NA NA NA NA NA
## Germany 0.033 0 0.000 0 NA NA NA NA
## Italy 1.000 0 0.000 0 1 NA NA NA
## Netherlands 0.000 0 0.000 0 0 0.000 NA NA
## Norway 0.000 0 0.000 0 0 0.000 0 NA
## Sweden 0.231 0 0.000 0 0 0.144 0 0.005
11.0.11 Question 10
The second goal of Kestilä is to show how socio-demographic characteristics affect attitudes towards immigrants and trust in politics in Finland. Select only the Finnish cases using the variable cntry
.
Click for explanation
To select the Finnish cases only, select rows for which $cntry
is equal to (==
) Finland:
<- df_with_scores[df_with_scores$cntry == "Finland", ] df_finland
After selecting the Finnish cases, we have to prepare our variables for analysis.
Thanks to the factor analyses above, our dependent variables are taken care of.
However, we should examine the independent variables to make sure there are no surprises,
and do any recoding necessary before we can run multiple regression. The independent variables are
c("gndr", "yrbrn", "eduyrs", "polintr", "lrscale")
.
Click for explanation
library(tidySEM)
descriptives(df_finland[, c("gndr", "yrbrn", "eduyrs", "polintr", "lrscale")])
Note that we still have to do some recoding. For example, the data contains participants’ year of birth instead of their age. We have to recode this variable. The data were collected in 2002, so we can simply subtract the year of birth of every participant from the year 2002.
$age <- (2002 - df_finland$yrbrn) df_finland
Furthermore, the variables polintr
and lrscale
are currently coded as character variables. If you analyze them like that, R will make dummies for all distinct values on these variables. Let’s recode those, too! The main challenge is to get the levels of these variables in the correct order, and convert them to numbers. We demonstrate two ways to do this, one way for each of the two variables.
For political interest, we convert the character variable to an ordered categorical variable, and we specify the correct order of labels. Then, we convert it to a numeric variable.
table(df_finland$polintr)
##
## Hardly interested Not at all interested Quite interested
## 842 228 785
## Very interested
## 144
$polintr <- ordered(df_finland$polintr, levels = c("Not at all interested", "Hardly interested", "Quite interested", "Very interested"))
df_finlandtable(df_finland$polintr)
##
## Not at all interested Hardly interested Quite interested
## 228 842 785
## Very interested
## 144
$polintr <- as.numeric(df_finland$polintr) df_finland
For political orientation, only the two extreme values of the scale are labeled as text. Thus, we can replace these labels with numbers, and then convert the entire variable to numeric.
table(df_finland$lrscale)
##
## 1 2 3 4 5 6 7 8 9 Left Right
## 25 62 148 183 599 199 296 217 79 24 59
$lrscale[df_finland$lrscale == "Left"] <- 0
df_finland$lrscale[df_finland$lrscale == "Right"] <- 10
df_finland$lrscale <- as.numeric(df_finland$lrscale) df_finland
Furthermore, Kestilä recoded polintr
in such a way that there are only two categories. The lowest two and highest two categories were combined. Here is how to do this in R, so you can replicate their analysis:
$polintr_dummy <- ifelse(df_finland$polintr <= 2,
df_finland"Not or hardly interested",
"Quite or very interested")
Next, run a number of multiple linear regression analyses with the (sub-)scales of attitudes towards immigrants and political trust as subsequent dependent variables, and the same predictors as Kestilä. Inspect your output.
Compare your results with the results from Kestilä. How do your results differ or agree with the results by Kestilä?
<- lm(trustinstEFA ~ gndr + age + eduyrs + polintr_dummy + lrscale, data = df_finland, na.action = na.omit)
fit_trust1 summary(fit_trust1)
##
## Call:
## lm(formula = trustinstEFA ~ gndr + age + eduyrs + polintr_dummy +
## lrscale, data = df_finland, na.action = na.omit)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.0364 -0.4740 0.1394 0.6379 2.3394
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.100441 0.125066 -0.803 0.4220
## gndrMale 0.010228 0.044386 0.230 0.8178
## age -0.001053 0.001356 -0.776 0.4377
## eduyrs 0.027904 0.006277 4.445 9.34e-06 ***
## polintr_dummyQuite or very interested 0.090948 0.045642 1.993 0.0465 *
## lrscale 0.051650 0.011037 4.680 3.10e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9159 on 1734 degrees of freedom
## (260 observations deleted due to missingness)
## Multiple R-squared: 0.03377, Adjusted R-squared: 0.03099
## F-statistic: 12.12 on 5 and 1734 DF, p-value: 1.454e-11
11.0.12 Question 11
Save your syntax and your data, you will need it next week.