top of page
Search

Measurement invariance: a simple idea, many practical problems

  • Writer: Yulia Kuzmina
    Yulia Kuzmina
  • 3 days ago
  • 7 min read

Recently, I started working intensively on a new research grant. As part of this project, I am doing a secondary analysis of TIMSS data, and one of my tasks is to evaluate measurement invariance for several scales across 4th and 8th grades.

I have been familiar with the concept of measurement invariance for many years. I used to teach on a master’s programme, and in the Structural Equation Modelling course we discussed measurement invariance in considerable detail. Yet every time I return to this topic, whether while preparing classes or while working on a real research project, I realize that it is much more complicated than it appears at first glance.

What often looks like a standard procedure in textbooks turns out to involve a number of important methodological decisions. In practice, when a researcher starts testing measurement invariance, several questions immediately arise:

  1. Should the indicators be treated as continuous or categorical, and which estimation method follows from that choice?

  2. What should be the sequence of invariance testing steps? The answer is not as obvious as it may seem.

  3. How should nested models be compared, and on what basis should one conclude that a more restricted model, for example a model with equal factor loadings across groups, fits significantly worse than a less restricted model?

  4. If partial invariance is being assessed, when should one stop freeing parameters?

  5. What should be done if measurement invariance is not established?

In this post, I want to focus only on the first two of these problems.

What is measurement invariance

Measurement invariance means that the same scale measures the same latent construct in the same way across groups. This issue becomes especially important when researchers want to compare results of some test or psychological scale across groups. For example, they may want to compare boys and girls, immigrants and native students, or different age groups. The same problem also arises when researchers compare countries. This is a very common situation in large-scale international educational studies such as PISA or TIMSS, where researchers often ask questions like: in which countries are students more motivated to learn mathematics or where do they report stronger self-confidence?

In all of these cases, the central issue is comparability. If we observe differences between groups, can we interpret them as differences in the latent construct itself, or might they instead reflect differences in how the scale functions across groups?

A common way to address this question is to test measurement invariance using multi-group CFA. In its standard form, this involves a sequence of increasingly restrictive models: configural invariance, metric invariance, scalar invariance. Configural invariance means that the same general factor structure holds across groups. Metric invariance means that factor loadings are equal across groups, so the items relate to the latent construct in the same way. Scalar invariance adds equality of item intercepts, and is usually treated as the condition required for comparing latent means.

Each subsequent model imposes additional equality constraints across groups and is therefore more restrictive than the previous one. This makes it possible to compare models and ask whether the more restricted model fits the data significantly worse than the less restricted one. One of the best-known ways to do this is to use the chi-square difference test, although, as I will discuss later, this is not always the most appropriate criterion.


If invariance does not hold, then group differences may reflect not only differences in the latent construct itself, but also differences in how the items function across groups.


Although measurement invariance has received substantial methodological attention, in practice it is tested much less often than one might expect. Maassen and colleagues (2025) found that measurement invariance was tested in only 4% of eligible comparisons in psychology articles with open data. When they conducted their own analyses, only 26% reached sufficient scalar invariance, whereas measurement invariance failed completely in 58% of cases. This suggests that many published group comparisons may rely on assumptions about measurement equivalence that were never actually checked (Maassen et al., 2025).

So, the general principle is clear: if we want to compare groups, measurement invariance is something we should take seriously. The difficulty begins when we try to do this in practice.

1. Continuous or categorical indicators?

The first problem concerns the indicators themselves. Many psychological and educational scales use Likert-type items, but it is not always clear whether these items should be treated as approximately continuous or explicitly categorical. This is not a minor technical detail. It affects the parameters that are estimated, the estimation method, the identification of the model, and the interpretation of the results (Wu & Leung, 2017; Lubke & Muthén, 2004).

In many applied studies, researchers simplify this issue by treating Likert-type indicators as interval variables. This makes the analysis easier because it allows them to use the familiar CFA framework for continuous indicators, conventional maximum likelihood estimation, and the standard chi-square logic for comparing nested models.

However, this choice is not always justified. Likert-type items are, strictly speaking, ordinal rather than interval variables. Although increasing the number of response categories makes Likert scales closer to the underlying continuous distribution. Simulation study showed that this approximation becomes more defensible when the number of categories is larger, especially around 11 points (Wu & Leung, 2017).

On other cases, treating ordinal variables as continuous may create problems in factor analysis and invariance testing (Lubke & Muthén, 2004). For example, simulation studies have shown that when ordinal items with few response categories are analyzed as continuous, chi-square statistics may be inflated, and parameter estimates may be attenuated (Rhemtulla et al., 2012). So, from a methodological point of view, it is often more appropriate to treat Likert-type indicators as ordinal and to use estimation methods designed for categorical data.

  If the indicators are treated as ordered categorical, the model changes in important ways. For continuous indicators, scalar invariance is formulated in terms of equality constraints on item intercepts, that is, one intercept per item. For ordered categorical indicators, however, intercepts are replaced by thresholds, and their number is larger: for each item, the number of thresholds equals the number of response categories minus one.

Moreover, for ordered categorical indicators, conventional maximum likelihood is usually not the most appropriate estimation method. This often means using weighted least squares type estimators rather than the standard ML approach.

However, this in turn creates a new set of problems and methodological difficulties, some of which are still not fully resolved. One of these concerns the sequence in which nested invariance models should be tested. More specifically, the question is which parameters should first be constrained to be equal across groups: factor loadings or thresholds.

2. What is the correct sequence of testing steps?

In many texts, the process is presented as linear: first configural invariance, then metric, then scalar. This is the conventional logic for continuous  indicators.

However, for ordered categorical indicators, Wu and Estabrook show that sequence should be changed. Their argument is based on model identification. With ordered categorical indicators, the latent response variables do not have a natural scale, so the baseline model must be identified by imposing additional constraints. Wu and Estabrook therefore argue that threshold invariance should be considered before loading invariance, because loading invariance cannot be meaningfully tested in the absence of threshold invariance.

At the same time, the Mplus User’s Guide presents a more familiar sequence for ordered categorical indicators, namely configural, metric, and scalar invariance. But even there the categorical case is not treated in the same way as the continuous one. In the metric model, Mplus suggests that the first threshold of each item equal across groups, and it also constrains the second threshold of the item that is used to set the metric of the factor. These constraints help define a common scale for the latent response variables across groups and make the comparison of loadings possible. So although the sequence looks conventional, the underlying logic is already different (Muthén & Muthén, 2017).

This point becomes especially clear in the discussion of partial invariance. The Mplus guide explicitly notes that for categorical indicators, thresholds and factor loadings for a given item should be freed together. In other words, for ordered categorical data, thresholds and loadings are not independent (Muthén & Muthén, 2017).

Thus, two different testing sequences are possible. Although they may ultimately lead to the same substantive conclusion, the sequence still matters because the fully constrained model is evaluated against a different comparison model in each case: in one sequence, it is compared with a model in which factor loadings are constrained equal, whereas in the other it is compared with a model in which thresholds are constrained equal.

Closing remarks

For me, these are the first two practical problems that appear almost immediately when one starts working seriously with measurement invariance: the nature of the indicators, and the sequence of testing steps.

At least in these two cases, there is some relatively clear methodological guidance. There is enough work showing that Likert-type indicators should not automatically be treated as continuous, and there are also arguments explaining the sequence of invariance testing.

However, the further one moves in the analysis, the fewer shared conventions there seem to be. Questions about how nested models should be compared, which fit criteria should be used, when partial invariance is acceptable, and what to do if invariance is not supported are much less settled.

In the next post, I will look more closely at one of these debates: how nested models are compared in measurement invariance testing, what conventions are commonly used, and why these conventions do not always work well.


References

Maassen, E., D’Urso, E. D., van Assen, M. A. L. M., Nuijten, M. B., De Roover, K., & Wicherts, J. M. (2025). The dire disregard of measurement invariance testing in psychological science. Psychological Methods, 30(5), 966–979.

Muthén, L. K., & Muthén, B. O. (2017). Mplus User’s Guide (8th ed.). Muthén & Muthén.

Svetina, D., Rutkowski, L., & Rutkowski, D. (2019). Multiple-group invariance with categorical outcomes using updated guidelines: An illustration using Mplus and the lavaan/semTools packages. Structural Equation Modeling: A Multidisciplinary Journal.

Tse, W. W.-Y., Lai, M. H. C., & Zhang, Y. (2024). Does strict invariance matter? Valid group mean comparisons with ordered-categorical items. Behavior Research Methods, 56, 3117–3139.

Rhemtulla, M., Brosseau-Liard, P. É., & Savalei, V. (2012). When can categorical variables be treated as continuous? A comparison of robust continuous and categorical SEM estimation methods under suboptimal conditions. Psychological methods, 17(3), 354.

Wu, H., & Leung, S.-O. (2017). Can Likert scales be treated as interval scales? A simulation study. Journal of Social Service Research, 43(4), 527–532.

Lubke, G. H., & Muthén, B. O. (2004). Applying multigroup confirmatory factor models for continuous outcomes to Likert scale data complicates meaningful group comparisons. Structural equation modeling, 11(4), 514-534.

Wu H, Estabrook R. Identification of Confirmatory Factor Analysis Models of Different Levels of Invariance for Ordered Categorical Outcomes. Psychometrika. 2016;81(4):1014-1045. doi:10.1007/s11336-016-9506-0

 
 
 
bottom of page