Cross-Validation in Psychological Research (Part 3): How It Is Used in Practice
- Yulia Kuzmina
- 5 days ago
- 5 min read

In the previous posts, I discussed the historical roots of cross-validation and the methodological decisions involved in implementing it. In this final post, I turn to how cross-validation is actually used in contemporary psychological research.
I will consider two examples that reflect different levels at which cross-validation operates:
1) as a technical resampling procedure used to analyse a single dataset,
2) as a broader research strategy aimed at testing generalizability across independent samples.
Cross-validation for model selection and prediction: Noetel et al. (2023)
A clear example of resampling-based cross-validation is provided by Noetel et al. (2023), who explicitly contrast explanatory and predictive approaches in educational psychology.
The authors examined how teacher behaviours predict changes in student engagement in physical education. Rather than committing to a single theoretical framework, they adopted a cross-theoretical approach. Predictors were derived from multiple motivational traditions, including self-determination theory, achievement goal theory, growth mindset theory, and transformational leadership theory.
In this study, the authors used cross-validation in the way it is commonly used in machine learning. They divided the data into a training set and a test set, and within the training set they applied K-fold cross-validation to select and tune the model.
This resulted in a large and theoretically heterogeneous predictor set. In a traditional explanatory framework, researchers might test one theory at a time or interpret regression coefficients within a single model. Noetel et al. instead focused on finding the model and combination of teacher behaviours that most accurately predicted changes in student engagement over time.
What did they find?
Using elastic-net regression combined with cross-validation, they found that:
Out-of-sample predictive performance was moderate but meaningful when evaluated on held-out data.
Some predictors that appeared important in standard regression models did not improve out-of-sample prediction.
The strongest predictors were not confined to any single theoretical framework.
Predictive performance was often driven by combinations of variables drawn from different theoretical traditions.
In other words, when the authors evaluated models based on how well they predicted new data, the results did not clearly support one theory over the others. The most useful predictors came from different theories. This means that a theory can look consistent and convincing within one dataset, but that does not guarantee that it will make the best predictions.
Methodologically, cross-validation was central to these conclusions.
Explanation vs Prediction: what changes?
In many ways, predictive workflow is standard in machine learning. Models are evaluated on data that were not used to fit them, and statistical significance is not the main criterion for comparing models.
What is distinctive in psychology is not the algorithm, but the shift in evaluation criteria.
In a traditional explanatory framework:
Models are evaluated within the same dataset in which they are estimated.
Researchers focus on regression coefficients, p-values, and effect sizes.
The central question is whether specific results support a given theory.
In a predictive framework:
Models are judged by their ability to reduce prediction error on unseen data.
Statistical significance is not the primary criterion.
A theoretically central variable may be dropped if it does not improve out-of-sample performance.
A theoretically peripheral variable may be retained if it consistently reduces prediction error.
This distinction is not about replacing explanation with prediction. It is about changing what counts as evidence. In this sense, the novelty in psychological research is epistemic rather than technical.
“Cross-validation” as external validation: Law et al. (2022)
A different usage of the term “cross-validation” appears in Law et al. (2022), who describe their work as a “cross-validation study.” In this case, cross-validation did not involve resampling within a single dataset. Instead, it referred to validation across independent samples.
The authors identified latent classes of young children with ADHD in one sample (Study 1), using latent class analysis. They then tested whether:
The same class structure could be recovered in a second independent sample (Study 2).
Classification rules derived from Study 1 could accurately classify individuals in Study 2.
What did they find?
Their results showed that:
· The same latent class solution could be identified in both samples, supporting the stability of the class structure.
· Classification based on the model from Study 1 showed acceptable sensitivity and specificity when applied to Study 2.
· Some differences in class proportions and parameter estimates were observed, indicating sample-specific variation.
In this context, cross-validation meant testing generalizability through replication. This corresponds to what is often called external validation or independent-sample validation in the methodological literature.
Two meanings, one principle
These two examples illustrate different levels at which cross-validation operates in psychological research. The underlying principle is the same, namely separating model estimation from model evaluation, but the role of cross-validation differs.
In Noetel et al. (2023), cross-validation is primarily a technical step within the analysis of a single dataset. It is used to guide model selection, tune parameters, and reduce overfitting. Here, cross-validation functions as part of the modelling workflow.
In Law et al. (2022), cross-validation represents a broader research strategy. A model identified in one sample is tested in an independent sample to examine whether the same structure can be reproduced. In this case, cross-validation is not just a technical procedure, but a design-level approach to assessing generalizability.
The second approach is more familiar within psychology. Testing findings in an independent sample has long been considered a desirable standard, even if practical constraints such as time, funding, and access to participants often make it difficult to implement. It remains something of an ideal.
The first approach may be more accessible in situations where collecting new data is not feasible. However, it is still not widely used in psychology. One reason is that many researchers are not yet fully comfortable with machine learning tools and workflows, even though these methods can be adapted to theory-driven research questions.
A personal reflection
I have an academic psychology background, and at first I honestly felt some resistance toward the machine learning approach. In many ML applications, models are built in a very data-driven way. You include many predictors and let the algorithm decide what matters, sometimes without much discussion of theory.
In academic psychology, we are trained differently. Theory usually comes first. We define constructs carefully, justify each variable, and try not to overload the model. I have always been skeptical of studies that collect a lot of data and then simply “see what comes out.” The idea of “garbage in, garbage out” is something most social scientists know very well.
I often think about Jacob Cohen’s essay Things I Have Learned (So Far) (1990). I feel very close to his position. He argued that, except for sample size, “less is more.” He warned against studies with too many independent and dependent variables and stressed the value of working with a small number of well-justified variables. At one point he joked that after hearing him repeat this so often, some of his students said that his ideal study had “10,000 cases and no variables.” I like that joke. It captures something important about parsimony and measurement.
From this perspective, machine learning can feel uncomfortable at first. But over time I have started to think that psychology can actually learn something from these methods, depending on the goal of the study. In the first study discussed earlier, different theoretically grounded predictors were compared in terms of how well they predicted changes in engagement. This does not replace theory, but it adds another way to evaluate it.
Of course, this brings up another issue: how confident are we in the measurement of the outcome we are trying to predict? In psychology, measurement is never a trivial matter. But that is a larger conversation.
Reference:
Noetel, M., Parker, P., Dicke, T., Beauchamp, M. R., Ntoumanis, N., Hulteen, R. M., ... & Lonsdale, C. (2023). Prediction versus explanation in educational psychology: A cross-theoretical approach to using teacher behaviour to predict student engagement in physical education. Educational Psychology Review, 35(3), 73.
Law, E., Sideridis, G., Alkhadim, G., Snyder, J., & Sheridan, M. (2022). Classifying Young Children with Attention-Deficit/Hyperactivity Disorder Based on Child, Parent, and Family Characteristics: A Cross-Validation Study. International journal of environmental research and public health, 19(15), 9195.
Cohen, J. (1990). Things I have learned (so far). American psychologist, 45(12), 1304.



Comments