Best practices in conducting and visualizing latent class (growth) analyses

Utrecht University, dept. Methodology & Statistics

Opening

In my Stats 1 class, many exercises describe situations like:

Grades are normally distributed, with \(\mu = 6, \sigma = 2\)

A student said: This is a stupid question, but are grades normally distributed?

Distribution of grades

Actual grades of the 347 students:

Camel plot

Types of latent variable analyses

	Observed variables
Latent variable	Continuous	Categorical
Continuous	Factor analysis	IRT
Categorical	Mixture model	Latent class analysis

Basic idea behind mixture modeling

Assume that population consists of K subpopulations
Model data as a function of (unknown) class membership
Simplest “model” is just the mean
- Estimate variances / covariances
- Different model for different classes

Estimated using EM algorithm

Applications

Analysis goals:

Test theory about categorical latent variable (e.g., identity status)
Classify individuals
Identify number of classes
Identify most discriminant items
Predict class membership from covariates
Predict outcomes from class membership

Challenges: “correct” number of classes?

Approach:

Run the same model for a range of possible classes, e.g., 1:6
Examine fit statistics

Pitfalls:

Not estimating one-class model
Using entropy as a fit index
- It signifies class separation
Cherry-picking fit statistics
- Report all; lack of consensus could indicate problem
- Weighted decision using AHP

Introducing tidyLPA

User-friendly software that implements best practices
Free means, variances, and covariances
Estimate and compare solutions for the number of classes
Beautiful visualizations

Sample workflow, tidyverse-style

library(tidyLPA)
grades %>%
  estimate_profiles(1:2)

## tidyLPA analysis using mclust: 
## 
##  Model Classes AIC     BIC     Entropy prob_min prob_max n_min n_max
##  1     1       1022.35 1029.59 1.00    1.00     1.00     1.00  1.00 
##  1     2       993.73  1008.21 0.74    0.91     0.94     0.43  0.57 
##  BLRT_p
##        
##  0.01

Sample workflow, tidyverse-style

grades %>%
  estimate_profiles(1:2) %>%
  compare_solutions()

## Compare tidyLPA solutions:
##
## Model Classes BIC
## 1 1 1029.590
## 1 2 1008.211
##
## Best model according to BIC is Model 1 with 2 classes.
##
## An analytic hierarchy process, based on the fit indices AIC, AWE, BIC, CLC,
and KIC (Akogul & Erisoglu, 2017), suggests the best solution is Model 1 with 2
classes.

Good practices for visualization

Show model parameters

Estimated mean
Estimated variance (if parameter)

Show parameter uncertainty

Confidence interval/band

Show classification uncertainty

Show how well classes capture raw data

If number of classes is is exploratory…

Compare several class solutions

Some examples

What stands out?

These visualizations…

Show only estimated mean (sometimes even unweighted mean of subgroup)
\(\sigma\) not visualized, even when it is a parameter
No SE’s
No classification uncertainty; treat class membership as known
Hide raw data, impossible to see whether classes are distinctive

Better profile plot

id_edu[, c("com3", "exp3")] %>%
  estimate_profiles(1:4) %>%
  plot_profiles()

Mixture densities plot

id_edu[, c("com3", "exp3")] %>%
  estimate_profiles(1:4) %>%
  plot_density()

Longitudinal extentions

Growth mixture models

Describe heterogeneity in developmental trajectories

Capture individual trajectories with latent growth model
Each individual follows a trajectory over time, described by latent intercept, slope, etc.
Apply LCA to latent growth variables

Analysis goals:

Identify classes that are characterized by different developmental trajectories

Example

# Load MplusAutomation
library(MplusAutomation)

# Do the latent class growth analysis
createMixtures(classes = 1:4, 
               filename_stem = "growth", 
               model_overall = "i s q | ec1@0 ec2@1 ec3@2 ec4@3 ec5@4 ec6@5;
                                i@0;  s@0;  q@0",
               rdata = empathy[, 1:6],
               ANALYSIS = "PROCESSORS = 2;")

runModels(filefilter = "growth")

results_growth <- readModels(filefilter = "growth")

Summary table

mixtureSummaryTable(results_growth)

##        Title Classes      AIC      BIC     aBIC Entropy T11_VLMR_PValue
## 1  1 classes       1 4951.896 4989.213 4960.649      NA              NA
## 2  2 classes       2 4172.444 4226.347 4185.088   0.819          0.0000
## 3  3 classes       3 3938.068 4008.556 3954.601   0.811          0.0015
## 4  4 classes       4 3851.840 3938.913 3872.264   0.838          0.0167
##   T11_LMR_PValue BLRT_PValue min_N max_N min_prob max_prob
## 1             NA          NA   467   467    1.000    1.000
## 2         0.0000           0   200   267    0.931    0.961
## 3         0.0019           0    93   238    0.880    0.926
## 4         0.0189           0    18   221    0.888    0.909

Published example - not good

DOI: 10.1007/s10964-014-0152-5

Visualization

plotGrowthMixtures(results_growth, rawdata = TRUE)

Latent transition analysis

Describe state changes

Individuals are in a hidden state at each time point (class membership)
- Identity Status Theory: Achievement, Moratorium, Diffusion, Foreclosure
Capture these states with mixture model
Model transition matrix over time

Example: Latent transition analysis, identity

createMixtures(
  classes = 4,
  filename_stem = "lta",
  model_overall = "c2 ON c1;",
  model_class_specific = c(
  "[com3] (com{C});  [exp3] (exp{C});",
  "[com5] (com{C});  [exp5] (exp{C});"
  ),
  rdata = id_edu[, c("com3", "exp3", "com5", "exp5")]
)

Do latent transition analysis

runModels(filefilter = "lta")
lta_id <- readModels("lta_2_class.out")

Some examples

Plot the latent transition matrix

plotLTA(lta_id)

Take home message

Good practices:

Compare several class solutions
Show estimated points/trajectory
Show confidence interval/band
Show how well classes capture raw data
Show classification / transition probabilities

Estimation

Mixture model estimated using EM algorithm

Two things must be estimated:

\(\theta\): Model parameters for each class
\(P(Z_i = k|X_i, \theta)\): The probability that individual i belongs to each of the classes k, given their observed data X and the model parameters

The problem is that these two are interdependent

Calculations for \(\hat{\theta}\) (means, variances, covariances in each class) are weighted by \(P(Z_i = k|X_i,\theta)\)
Probability of class membership depends on that class’ parameters

Estimation 2

So, we go back and forth:

E-step:
Maximize \(P(Z_i = k|X_i,\theta)\) for some starting values, \(\theta^0\)
M-step:
Estimate \(\hat{\theta}\), using the estimate of \(P(Z_i = k|X_i,\theta)\) from the E-step
Plug esimate of \(\hat{\theta}\) in to E-step, repeat until convergence.

Back to presentation