Cross-Validation in Psychology: Historical Roots and Current Use (Part 1)

Yulia Kuzmina
13 minutes ago
5 min read

Two meanings of “cross-validation”

The term cross-validation is used in psychology in at least two related but not identical senses. Historically, these meanings were closely connected, but in contemporary research they often refer to different methodological practices.

In much of psychological and clinical research, cross-validation traditionally refers to validation on an independent sample. In this sense, a model or classification scheme is estimated in one dataset and then evaluated in another dataset drawn from the same or a closely related population. This usage is closely tied to concerns about generalizability and replication and is often what applied researchers have in mind when they speak about “cross-validating” a model.

In an ideal world, psychological models would be validated on an independent sample drawn from the same population. This recommendation is not new and is clearly stated in applied-methods textbooks. For example, Multivariate Data Analysis (Hair, 2009) describes empirical validation as the final step of modeling:

“After identifying the best regression model, the final step is to ensure that it represents the general population (generalizability) and is appropriate for the situation in which it will be used (transferability)… The most appropriate empirical validation approach is to test the regression model on a new sample drawn from the general population. A new sample will ensure representativeness and can be used in several ways. First, the original model can predict values in the new sample and predictive fit can be calculated…” (p. 203)

In practice, however, collecting a truly new validation sample is often difficult due to time, cost, or access to participants. This gap between what is recommended and what is feasible is precisely where cross-validation becomes useful.

In modern machine learning and statistical learning theory, cross-validation usually refers to resampling-based procedures applied within a single dataset, such as K-fold or leave-one-out cross-validation. Here, the main goal is to estimate out-of-sample predictive performance when independent data are not available, and to support model selection or tuning. In this context, cross-validation often functions as a technical tool embedded in automated modeling workflows.

Historically, these two meanings are not independent. Early methodological work on cross-validation emerged from the problem of approximating independent-sample validation when collecting new data was impractical. Over time, resampling-based cross-validation became increasingly formalized and computationally efficient, especially within machine learning, where it is now often treated as a routine technical step rather than as a conceptual tool for thinking about generalization.

A short history: cross-validation is not a machine-learning invention

The idea of assessing a model on “new” data is much older than modern machine learning practice. In the 1950s, the Psychometric Society hosted a symposium titled The Need and Means of Cross-Validation, and by the 1970s the logic had been formalized in methodological statistics.

A classic reference is Mervyn Stone’s 1974 paper Cross-validatory Choice and Assessment of Statistical Predictions, which reviews earlier uses of cross-validatory logic going back to the 1930s.

The central question was already clear: if we cannot collect new data, how can we simulate validation? In Validation of Regression Models: Methods and Examples, Ronald Snee (1977) formulates this motivation very directly:

“…the collection of new data is the preferred method of model validation. In many instances this is not practical nor possible. In this situation a procedure which simulates the collection of new data is needed. A reasonable way to proceed is to split the data in hand into two sets. The first set of data, called the estimation data, is used to estimate the model coefficients. The remaining data points, called the prediction data, are used to measure the prediction accuracy of the model. Some authors refer to data splitting as cross-validation. A half-and-half split appears to be the most popular method.”

Early cross-validation therefore often meant data splitting, that is, dividing the dataset into estimation and prediction sets (or, in modern terms, training and test sets). Researchers quickly realized the main limitation of this approach: with small samples, both sets become weak, which can undermine model estimation and evaluation.

Early concerns: how should the data be split?

From the beginning, the main concern surrounding data splitting was not whether the sample should be divided, but how this division should be performed. In particular, researchers were interested in how controllable and systematic the splitting procedure could be.

One early attempt to formalize data splitting was proposed by Snee (1977), who introduced an algorithm known as DUPLEX. The goal of this algorithm was to create two subsamples, an estimation set and a prediction set, that were as similar as possible in their statistical properties while still being distinct.

The DUPLEX algorithm proceeds as follows:

A list of all candidate points (observations) is created.
The points are normalized and orthonormalized, and Euclidean distances are computed for all possible pairs of points.
The two points that are farthest apart are assigned to the estimation set.
The next two most distant points among the remaining observations are assigned to the prediction set.
From the remaining points, the observation farthest from those already in the estimation set is added to that set.
Analogously, the observation farthest from those already in the prediction set is added to the prediction set.
This alternating process continues until all observations have been assigned.

Once a point is assigned to a set, it is removed from further consideration. After both sets are formed, their statistical properties should be examined to ensure that they are approximately comparable.

Although split-sample validation became popular, it was not always appropriate for psychological research, where sample sizes are often small. Dividing the data into two parts substantially reduces the effective sample size in both sets, which can negatively affect model performance and stability.

Leave-one-out cross-validation as a response

To address these limitations, alternative approaches were proposed. Most notably, Stone (1974) and Geisser (1975) independently introduced leave-one-out cross-validation (LOO).

In LOO, the model is estimated on N−1N - 1N−1 observations, and the remaining single observation is used for validation. This procedure is repeated until each observation has served once as the validation case. Because the training set in each iteration is almost as large as the full dataset, LOO allows more efficient use of limited data while preserving the logic of out-of-sample evaluation.

Why cross-validation is still not routine in psychology

Despite its long history, cross-validation has not become a default practice in psychology. Two reasons are mentioned repeatedly:

Small samples, which make researchers reluctant to further reduce effective sample size by splitting the data.
Workflow and software limitations, as many psychologists are trained in environments where cross-validation is not a standard option.

In the next posts, I will turn to contemporary methodological discussions and examine how cross-validation is currently conceptualized in psychology, including debates about predictive performance, model selection, and the role of resampling-based methods. I will also discuss concrete examples from empirical psychological research, illustrating how cross-validation is used in practice both as a tool for assessing generalization across samples and as a method for selecting models or tuning parameters within a single dataset.

Reference:

Stone, M. (1974). Cross-validatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society: Series B, 36(2), 111–133.
Snee, R. D. (1977). Validation of regression models: methods and examples. Technometrics, 19(4), 415–428.
Hair, J. (2009). Multivariate Data Analysis (quote p. 203).
Geisser, S. (1975). The predictive sample reuse method with applications. Journal of the American statistical Association, 70(350), 320-328.