This Vignette describes the SMART-LCA Checklist: Standards for More
Accuracy in Reporting of different Types of Latent Class Analysis,
introduced in Van Lissa, C. J., Garnier-Villarreal, M., & Anadria,
D. (2023). Recommended Practices in Latent Class Analysis using the
Open-Source R-Package tidySEM. Structural Equation Modeling. https://doi.org/10.1080/10705511.2023.2250920. This
version of the checklist corresponds to the tidySEM
R-package’s version , 0.2.7.
The paper discusses the best practices on which the checklist is
based and describes specific points to check when writing, reviewing, or
reading a paper in greater detail. However, this vignette may be updated
to keep pace with tidySEM
package development, whereas the
print publication will remain static.
Note that, although the steps below are numbered for reference
purposes, we acknowledge the process of conducting and reporting
research is not always linear.
- Pre-Analysis
- Define a research question and study design suitable for LCA.
- Determine whether the application of LCA is confirmatory or
exploratory.
- Choose valid and reliable indicators with an appropriate data type
for LCA.
- Justify sample size.
- Optionally: pre-register the analysis plan.
- Optionally: use Preregistration-As-Code.
- Examining Observed Data
- Assess indicator distributions and scales.
- Does the observed distribution match the intended measurement level?
E.g., continuous variables have many unique values, categorical
variables have a well-balaced response distribution.
- Are all indicators on similar scales?
- Are there distributional idiosyncrasies that need to be accounted
for in model specification or by transforming the data (e.g., extreme
skew)?
- Data Preprocessing
- Make sure continuous variables have type
numeric
or
integer
and an interval measurement level.
- Convert ordered and binary variables to
mxFactor
.
- Convert nominal variables to binary dummy variables.
- Rescale indicators that are on very different scales to prevent
convergence issues.
- Optionally, transform data to account for distributional
idiosyncracies that could cause the model to fit poorly.
- Missing data
- Examine and report proportion of missingness per variable.
- Report MAR test using
mice::mcar()
.
- Select and report appropriate method to handle missingness. In
tidySEM
, this is full information maximum likelihood
(FIML), which is valid regardless of the outcome of the MAR test.
- Model specification
- For exploratory LCA, list all sensible alternative
specifications
- In confirmatory LCA, justify alternative model specifications on
theoretical grounds
- Clearly report the model specification, so it is clear what all
parameters are.
- Optionally, verify that the model can capture distributional
idiosyncracies (see Data Preprocessing).
- Number of classes
- Justify the maximum number of classes \(k\).
- In exploratory LCA, estimate 1:\(k\) classes.
- In confirmatory LCA, estimate the theoretical number of classes and
choose other models to benchmark it against. This should include a
1-class model, optionally competing theoretical models, and optionally
models with a different numbers of classes.
- Justify the criteria used for class enumeration, which can include:
- Information criteria.
- LR tests.
- The
BLRT()
is best supported by the literature;
lr_lmr()
has been criticized but is also available.
- Theoretically expected class solution.
- Justify the criteria used to eliminate models from consideration,
including:
- Model convergence.
- Classification diagnostics.
- Local identifiability (acceptable number of observations per
parameter in each class).
- Theoretical interpretability of the results.
- Columns
np_global
and np_local
in output
of table_fit()
- Transparency and Reproducibility
- Use free open-source software to enable reproducibility.
- Comprehensively cite literature, software, and data sources.
- Share analysis code.
- If possible, share data. Otherwise, share a synthetic dataset.
- Synthetic data can be created for an entire dataset in a
model-agnostic way via
worcs::synthetic()
- Synthetic data can be generated from the LCA model by running
mxGenerateData()
on the model object.
- Share all digital research output in line with the FAIR
principles.
- Optionally, use the Workflow for Open Reproducible Code in Science
and the
worcs
R-package automate creating a reproducible
research archive.
- Estimation and Convergence
- Report the method of estimation. In
tidySEM
, this is
simulated annealing with informative start values, determined via
K-means clustering for complete data and hierarchical clustering when
there are missing values.
- Assess convergence of all models before interpreting them.
- Optionally, if there is non-convergence, use functions like
mxTryHard()
to aid convergence.
- E.g., if you have concerns about model 4 in output object
res
, you could run
res[[4]] <- mxTryHard(res[[4]], extraTries = 200)
to run
200 permutations.
- Reporting Results
- Report the fit of all models under consideration in a table.
- Report classification diagnostics for all models under
consideration.
- Report the minimal class proportions for all models under
consideration.
- Report class proportions for the final model
class_prob(res, "sum.posterior")
- Report point estimates, standard errors and confidence intervals for
model parameters.
- Optionally, convert thresholds to conditional item probabilities.
- Assign informative class names while clarifying that these names are
just shorthand.
- Inference
- Standard p-values in
table_results()
test the
hypothesis that parameters are equal to zero. Consider whether these
tests are meaningful and relevant.
- Control for, or reflect on, the study-wide Type I error level.
- Optionally, use the standard errors to test other null
hypotheses.
- Optionally, use
wald_test()
to test informative
hypotheses.
- Optionally, to test a confirmatory LCA model, compare parameter
estimates to hypothesized values.
- Optionally, use
lr_test()
to compare parameters across
classes.
- Visualization
- Visualize raw data to assess class separability and model fit.
- Show point estimates of model parameters.
- If possible, show confidence bounds.
- Optionally, show multivariate distributions for multivariate
models.
- Follow-up analyses.
- Account for classification inaccuracy in follow-up analyses.