Chapter 14 Introduction to model comparison
A key goal of cognitive science is to decide which theory under consideration accounts for the experimental data better. This can be accomplished by implementing the theories (or some aspects of them) as Bayesian models and comparing their predicting power. Thus, model comparison and hypothesis testing are closely related ideas. There are two Bayesian perspectives on model comparison: a prior predictive perspective based on the Bayes factor using marginal likelihoods, and a posterior predictive perspective based on cross-validation. The main characteristic difference between the prior predictive approach (Bayes factor) versus the posterior predictive approach (cross-validation) is the following: The Bayes factor examines how well the model (prior and likelihood) explains the experimental data. By contrast, the posterior predictive approach assesses model predictions for held-out data after seeing most of the data.
14.1 Prior predictive vs. posterior predictive model comparison
The predictive accuracy of the Bayes factor is only based on its prior predictive distribution. In Bayes factor analyses, the prior model predictions are used to evaluate the support that the data give to the model. By contrast, in cross-validation, the model is fit to a large subset of the data (i.e., the training data). The posterior distributions of the parameters of this fitted model are then used to make predictions for held-out or validation data, and model fit is assessed on this subset of the data. Typically, this process is repeated several times, until the subsets of the entire data set are assessed as held-out data. This approach attempts to assess whether the model will generalize to truly new, unobserved data. Of course, the held-out data is usually not “truly new” because it is part of the data that was collected, but at least it is data that the model has not been exposed to. That is, the predictive accuracy of cross-validation methods is based on how well the posterior predictive distribution that is fit to most of the data (i.e., the training data) characterizes out-of-sample data (i.e., the test or held-out data). Notice that one could in principle use the posterior predictive approach using truly new data, by just repeating the experiment with new subjects and then treating that new data as held-out data.
Coming back to Bayes factors, the prior predictive distribution is obviously highly sensitive to the priors: it evaluates the probability of the observed data under prior assumptions. By contrast, the posterior predictive distribution is less dependent on the priors because the priors are combined with the likelihood (and are thus less influential, given sufficient data) before making predictions for held-out validation data.
Jaynes (2003, Chapter 20) compares these two perspectives to “a cruel realist” and “a fair judge”. According to Jaynes, Bayes factor adopts the posture of a cruel realist, who “judge[s] each model taking into account the prior information we actually have pertaining to it; that is, we penalize a model if we do not have the best possible prior information about its parameters, although that is not really a fault of the model itself.” By contrast, cross-validation adopts the posture of a scrupulously fair judge, “who insists that fairness in comparing models requires that each is delivering the best performance of which it is capable, by giving each the best possible prior probability for its parameters (similarly, in Olympic games we might consider it unfair to judge two athletes by their performance when one of them is sick or injured; the fair judge might prefer to compare them when both are doing their absolute best).”
14.2 Some important points to consider when comparing models
Regardless of whether we use Bayes factor or cross-validation or any other method for model comparison, there are several important points that one should keep in mind:
Although the objective of model comparison might ultimately be to find out which of the models under consideration generalizes better, this generalization can only be done well within the range of the observed data (see Vehtari and Lampinen 2002; Vehtari and Ojanen 2012). That is, if one hypothesis implemented as the model \(\mathcal{M}_1\) shows to be superior to a second hypothesis, implemented as the model \(\mathcal{M}_2\), according to Bayes factor and/or cross-validation and evaluated with a young western university student population, this doesn’t mean that \(\mathcal{M}_1\) will be superior to \(\mathcal{M}_2\) when it is evaluated with a broader population (and in fact it seems that many times it won’t, see Henrich, Heine, and Norenzayan 2010). However, if we can’t generalize even within the range of the observed data (e.g., university students in the northern part of the western hemisphere), there is no hope of generalizing outside of that range (e.g., non-University students). Navarro (2019) argues that one of the most important functions of a model is to encourage directed exploration of new territory; our view is that this makes sense only if historical data can also be accounted for. In practice, what that means for us is that evaluating a model’s performance should be carried out using historical benchmark data in addition to any new data one has; just using isolated pockets of new data to evaluate a model is not convincing. For examples from psycholinguistics of model evaluation using historical benchmark data, see Nicenboim, Vasishth, and Rösler (2020) and Yadav et al. (2023).
Model comparison can provide a quantitative way to evaluate models, but this cannot replace understanding the qualitative patterns in the data (see, e.g., Navarro 2019). A model can provide a good fit by behaving in a way that contradicts our substantive knowledge. For example, Lissón et al. (2021) examine two computational models of sentence comprehension. One of the models yielded higher predictive accuracy when the parameter that is related to the probability of correctly comprehending a sentence was higher for impaired subjects (individuals with aphasia) than for the control population. This contradicts domain knowledge—impaired subjects are generally observed to show worse performance than unimpaired control subjects—and led to a re-evaluation of the model.
Model comparison is based on finding the most “useful model” for characterizing our data, but neither the Bayes factor nor cross-validation (nor any other method that we are aware of) guarantees selecting the model closest to the truth (even with enough data). This is related to our previous point: A model that’s closest to the true generating data process is not guaranteed to produce the best (prior or posterior) predictions, and a model with a clearly wrong generating data process is not guaranteed to produce poor (prior or posterior) predictions. See Wang and Gelman (2014), for an example with cross-validation; and Navarro (2019) for a toy example with Bayes factors.
One should also check that the precision of the data being modeled is high; if an effect is being modeled that has high uncertainty (the posterior distribution of the target parameter is widely spread out), then any measure of model fit can be uninformative because we don’t have accurate estimates of the effect of interest. In the Bayesian context, this implies that the posterior predictive distributions of the effects generated by the model should be theoretically plausible and reasonably constrained, and the target parameter of interest should have as high precision as possible; this implies that we need to have sufficient data if we want to obtain precise estimates of the parameter of interest. What counts as sufficient will depend on the topic being studied.51 Later in this part of the book, we will discuss the adverse impact of imprecision in the data on model comparison (see section 15.5.2). We will show that, in the face of low precision, we generally won’t learn much from model comparison.
When comparing a null model with an alternative model, it is important to be clear about what the null model specification is. For example, in section 5.2.4, we encountered the correlated varying intercepts and varying slopes model for the Stroop effect. The
brms
formula for the full model was:
n400 ~ 1 + c_cloze + (1 + c_cloze | subj)
The formal statement of this model is:
\[\begin{equation} signal_n \sim \mathit{Normal}(\alpha + u_{subj[n],1} + c\_cloze_n \cdot (\beta+ u_{subj[n],2}),\sigma) \end{equation}\]
If we want to test the null hypothesis that centered cloze has no effect on the dependent variable, one null model is:
n400 ~ 1 + (1 + c_cloze | subj) (Model M0a)
Formally, this would be stated as follows (the \(\beta\) term is removed as it is assumed to be \(0\)):
\[\begin{equation} signal_n \sim \mathit{Normal}(\alpha + u_{subj[n],1} + c\_cloze_n \cdot u_{subj[n],2},\sigma) \end{equation}\]
In model M0a
, by-subject variability is allowed; just the population-level (or fixed) effect of centered cloze is assumed to be zero. This is called a nested model comparison, because the null model is subsumed in the full model.
An alternative null model could remove only the varying slopes:
n400 ~ 1 + c_cloze + (1 | subj) (Model M0b)
Formally:
\[\begin{equation} signal_n \sim \mathit{Normal}(\alpha + u_{subj[n],1} + c\_cloze_n \cdot \beta,\sigma) \end{equation}\]
Model M0b
, which is also nested inside the full model, can be used to test a different null hypothesis than M0a
above: is the between-subject variability in the centered cloze effect zero?
Yet another possibility is to remove both the population-level and group-level (or random) effects of centered cloze:
n400 ~ 1 + (1 | subj) (Model M0c)
Formally:
\[\begin{equation} signal_n \sim \mathit{Normal}(\alpha + u_{subj[n],1},\sigma) \end{equation}\]
Model M0c
is also nested inside the full model, but it now has two parameters missing instead of one: \(\beta\) and \(u_{subj[n],1}\). Usually, it is best to compare models by removing one parameter; otherwise one cannot be sure which parameter was responsible for our rejecting or accepting the null hypothesis.
Box 14.1 Credible intervals should not be used to reject a null hypothesis
Researchers often incorrectly use credible intervals for null hypothesis testing, that is, to test whether a parameter \(\beta\) is zero or not. A common approach is to check whether zero is included in the 95% credible interval for the parameter \(\beta\); if it is, then the null hypothesis that the effect is zero is accepted; and if zero is outside the interval, then the null is rejected. For example, in a tutorial paper that two of the authors of this book wrote (Nicenboim and Vasishth 2016), we incorrectly suggest that the credible interval can be used to reject the hypothesis that the \(\beta\) is zero. This is generally not the correct approach. The problem with this approach is that it is a heuristic that will work in some cases and might be misleading in others (for an example, see Vasishth, Yadav, et al. 2022). Unfortunately, when they will work or not is in fact not well-defined.
Why is the credible-interval approach only a heuristic? One line of (generally incorrect) reasoning that justifies looking at the overlap between credible intervals and zero is based on the fact that the most likely values of \(\beta\) lie within 95% credible interval.52 This entails that if zero is outside the interval, it must have a low probability density. This is true, but it’s meaningless: Regardless of where zero lies (or any point value), zero will have a probability mass of exactly zero since we are dealing with a continuous distribution. The lack of overlap doesn’t tell us how much posterior probability the null model has.
A partial solution could be to look at a probability interval close to zero rather than zero (e.g., an interval of, say, \(-2\) to \(2\) ms in a response time experiment), so that we obtain a non-zero probability mass. While the lack of overlap would be slightly more informative, excluding a small interval can be problematic when the prior probability mass of that interval is very small to begin with (as was the case with the regularizing priors we assigned to our parameters). Rouder, Haaf, and Vandekerckhove (2018) show that if prior probability mass is added to the point value zero using a spike-and-slab prior (or if probability mass is added to the small interval close to zero if one considers that equivalent to the null model), looking at whether zero is in the 95% credible interval is analogous to the Bayes factor. Unfortunately, the spike-and-slab prior cannot be incorporated in Stan, because it relies on a discrete parameter. However, other programming tools (like PyMC3, JAGS, or Turing) can be used if such a prior needs to be fit; see the further readings at the end of the chapter.
Rather than looking at the overlap of the 95% credible interval, we might be tempted to conclude that there is evidence for an effect because the probability that a parameter is positive is high, that is \(P(\beta > 0) >> 0.5\). However, the same logic from the previous paragraph renders this meaningless. Given that the probability mass of a point value, \(P(\beta = 0)\), is zero, what we can conclude from \(P(\beta > 0) >> 0.5\) is that \(\beta\) is very likely to be positive rather than negative, but we can’t make any assertions about whether \(\beta\) is exactly zero.
As we saw, the main problem with these heuristics is that they ignore that the null model is a separate hypothesis. In many situations, the null hypothesis may not be of interest, and it might be perfectly fine to base our conclusions on credible intervals or \(P(\beta > 0)\). The problem arises when these heuristics are used to provide evidence in favor or against the null hypothesis. If one wants to argue about the evidence in favor of or against a null hypothesis, Bayes factors or cross-validation will be needed. These are discussed in the next two chapters.
How can credible intervals be used sensibly? The region of practical equivalence (ROPE) approach (Spiegelhalter, Freedman, and Parmar 1994; Freedman, Lowe, and Macaskill 1984; and, more recently, Kruschke and Liddell 2018; Kruschke 2014) is a reasonable alternative to hypothesis testing and arguing for or against a null. This approach is related to the spike-and-slab discussion above. In the ROPE approach, one can define a range of values for a target parameter that is predicted before the data are seen. Of course, there has to be a principled justification for choosing this range a priori; an example of a principled justification would be the prior predictions of a computational model. Then, the overlap (or lack thereof) between this predicted range and the observed credible interval can be used to infer whether one has estimates consistent (or partly consistent) with the predicted range. Here, we are not ruling out any null hypothesis, and we are not using the credible interval to make a decision like “the null hypothesis is true/false.”
There is one situation where credible intervals could arguably be used to carry out a null hypothesis test. When priors are flat, credible intervals can show frequentist properties, making it reasonable to check whether zero falls within the credible interval. For example, Newall et al. (2023) use credible intervals as confidence intervals after calibration. They explicitly verify that 5% of the 95% credible intervals exclude zero when no effect exists. When using such an approach, a verification step would be necessary. We don’t discuss this approach any further because our aim in this part of the book is not to derive frequentist statistics from Bayesian analysis, but to use Bayesian methods for obtaining posterior probabilities and Bayes factors, focusing on Bayesian hypothesis testing.
14.3 Further reading
Roberts and Pashler (2000) and Pitt and Myung (2002) argue for the need of going beyond “a good fit” (this is a good posterior predictive check in the context of Bayesian data analysis) and argue for the need of model comparison and a focus on measuring the generalizability of a model. Navarro (2019) deals with the problematic aspects of model selection in the context of psychological literature and cognitive modeling. Fabian Dablander’s blog post, https://fabiandablander.com/r/Law-of-Practice.html, shows a very clear comparison between Bayes factor and PSIS-LOO-CV. Rodriguez, Williams, and Rast (2022) provides JAGS code for fitting models with spike-and-slab priors. Fabian Dablander has a comprehensive blog post on how to implement a Gibbs sampler in R when using such a prior: https://fabiandablander.com/r/Spike-and-Slab.html. Yadav et al. (2023) uses 17 different data sets for model comparison using cross-validation, holding out each data set successively; this is an example of evaluating the predictive performance of a model on truly new data.
References
Bever, Thomas G. 1970. “The Cognitive Basis for Linguistic Structures.” Cognition and the Development of Language.
Freedman, Laurence S., D. Lowe, and P. Macaskill. 1984. “Stopping Rules for Clinical Trials Incorporating Clinical Opinion.” Biometrics 40 (3): 575–86.
Henrich, Joseph, Steven J. Heine, and Ara Norenzayan. 2010. “The Weirdest People in the World?” Behavioral and Brain Sciences 33 (2-3): 61–83. https://doi.org/10.1017/S0140525X0999152X.
Jaynes, Edwin T. 2003. Probability Theory: The Logic of Science. Cambridge University Press.
Kruschke, John K. 2014. Doing Bayesian Data Analysis: A tutorial with R, JAGS, and Stan. Academic Press.
Kruschke, John K., and Torrin M. Liddell. 2018. “The Bayesian New Statistics: Hypothesis Testing, Estimation, Meta-Analysis, and Power Analysis from a Bayesian Perspective.” Psychonomic Bulletin & Review 25 (1): 178–206. https://doi.org/https://doi.org/10.3758/s13423-016-1221-4.
Lissón, Paula, Dorothea Pregla, Bruno Nicenboim, Dario Paape, Mick van het Nederend, Frank Burchert, Nicole Stadie, David Caplan, and Shravan Vasishth. 2021. “A Computational Evaluation of Two Models of Retrieval Processes in Sentence Processing in Aphasia.” Cognitive Science 45 (4): e12956. https://onlinelibrary.wiley.com/doi/full/10.1111/cogs.12956.
Newall, Philip W. S., Taylor R. Hayes, Henrik Singmann, Leonardo Weiss-Cohen, Elliot A. Ludvig, and Lukasz Walasek. 2023. “Evaluation of the ’Take Time to Think’ Safer Gambling Message: A Randomised, Online Experimental Study.” Behavioural Public Policy, 1–18. https://doi.org/10.1017/bpp.2023.2.
Nicenboim, Bruno, and Shravan Vasishth. 2016. “Statistical methods for linguistic research: Foundational Ideas - Part II.” Language and Linguistics Compass 10 (11): 591–613. https://doi.org/10.1111/lnc3.12207.
Nicenboim, Bruno, Shravan Vasishth, and Frank Rösler. 2020. “Are Words Pre-Activated Probabilistically During Sentence Comprehension? Evidence from New Data and a Bayesian Random-Effects Meta-Analysis Using Publicly Available Data.” Neuropsychologia 142. https://doi.org/10.1016/j.neuropsychologia.2020.107427.
Paape, Dario, Garrett Smith, and Shravan Vasishth. 2024. “Do Local Coherence Effects Exist in English Reduced Relative Clauses?” https://doi.org/https://doi.org/10.31219/osf.io/wpke4.
Pitt, Mark A., and In Jae Myung. 2002. “When a Good Fit Can Be Bad.” Trends in Cognitive Sciences 6 (10): 421–25. https://doi.org/10.1016/S1364-6613(02)01964-2.
Roberts, Seth, and Harold Pashler. 2000. “How Persuasive Is a Good Fit? A Comment on Theory Testing.” Psychological Review 107 (2): 358–67. https://doi.org/https://doi.org/10.1037/0033-295X.107.2.358.
Rodriguez, Josue E., Donald R. Williams, and Philippe Rast. 2022. “Who Is and Is Not ‘Average’? Random Effects Selection with Spike-and-Slab Priors.” Psychological Methods. https://doi.org/https://doi.org/10.1037/met0000535.
Rouder, Jeffrey N, Julia M. Haaf, and Joachim Vandekerckhove. 2018. “Bayesian Inference for Psychology, Part IV: Parameter Estimation and Bayes Factors.” Psychonomic Bulletin & Review 25 (1): 102–13. https://doi.org/https://doi.org/10.3758/s13423-017-1420-7.
Spiegelhalter, David J., Laurence S. Freedman, and Mahesh K. B. Parmar. 1994. “Bayesian Approaches to Randomized Trials.” Journal of the Royal Statistical Society. Series A (Statistics in Society) 157 (3): 357–416.
Tabor, Whitney, Bruno Galantucci, and Daniel Richardson. 2004. “Effects of Merely Local Syntactic Coherence on Sentence Processing.” Journal of Memory and Language 50: 355–70.
Vasishth, Shravan, Himanshu Yadav, Daniel J. Schad, and Bruno Nicenboim. 2022. “Sample Size Determination for Bayesian Hierarchical Models Commonly Used in Psycholinguistics.” Computational Brain and Behavior. https://doi.org/https://doi.org/10.1007/s42113-021-00125-y.
Vehtari, Aki, and Jouko Lampinen. 2002. “Bayesian Model Assessment and Comparison Using Cross-Validation Predictive Densities.” Neural Computation 14 (10): 2439–68. https://doi.org/10.1162/08997660260293292.
Vehtari, Aki, and Janne Ojanen. 2012. “A Survey of Bayesian Predictive Methods for Model Assessment, Selection and Comparison.” Statistical Surveys 6 (0): 142–228. https://doi.org/10.1214/12-ss102.
Wang, Wei, and Andrew Gelman. 2014. “Difficulty of Selecting Among Multilevel Models Using Predictive Accuracy.” Statistics at Its Interface 7: 1–8. https://doi.org/https://dx.doi.org/10.4310/SII.2015.v8.n2.a3.
Yadav, Himanshu, Garrett Smith, Sebastian Reich, and Shravan Vasishth. 2023. “Number Feature Distortion Modulates Cue-Based Retrieval in Reading.” Journal of Memory and Language 129. https://doi.org/10.1016/j.jml.2022.104400.
As an example from psycholinguistics, strong garden-path effects like those elicited by “The horse (that was) raced past the barn fell” (Bever 1970) may be easy to detect with high precision with a relatively small number of subjects, but subtle effects such as local coherence (Tabor, Galantucci, and Richardson 2004) will probably require a much larger sample size to detect the effect with high precision (Paape, Smith, and Vasishth 2024).↩︎
This is also strictly true only in a highest density interval (HDI), this is a credible interval where all the points within the interval have a higher probability density than points outside the interval. However, when posterior distributions are symmetrical, these intervals are virtually identical to the equal-tail intervals we use in this book.↩︎