Chapter 12 Custom distributions in Stan
Stan includes a large number of distributions, but what happens if we need a distribution that is not provided? In many cases, we can simply build a custom distribution by combining the ever-growing number of functions available in the Stan language.
12.1 A change of variables with the reciprocal normal distribution
In previous chapters, when faced with response times, we assumed a log-normal distribution. The log-normal distribution moves the inferences relating to the location parameter into a multiplicative frame (see Box 4.3). Another alternative, however, is to assume a “reciprocal”-normal distribution of response times. This is referred to hereafter as rec-normal in the text, and as the \(\mathit{RecNormal}\) distribution in equations. That is, we may want to assume that the reciprocal of the response times are normally distributed. This assumption may be theoretically motivated (Harris et al. 2014; Harris and Waddington 2012) or it may arise from the application of the Box-Cox variance stabilizing transform procedure (Box and Cox 1964). An example from psycholinguistics of a data analysis with a reciprocal transform of reading time data appears in Wu, Kaiser, and Vasishth (2017).
\[\begin{equation} \begin{aligned} 1/y &\sim \mathit{Normal}(\mu, \sigma) \\ y &\sim \mathit{RecNormal}(\mu, \sigma) \end{aligned} \tag{12.1} \end{equation}\]
An interesting aspect of the rec-normal is that it affords an interpretation of the location parameter in terms of rate or speed rather than time. Analogously to the case of the log-normal, neither the location \(\mu\) nor the scale \(\sigma\) are in the same scale as the dependent variable \(y\). These parameters are in the same scale as the transformed dependent variable (here, \(1/y\)) that is assumed to be normally distributed.
By setting \(\mu = 0.002\) and \(\sigma = 0.0004\), data can be generated from the rec-normal that looks right-skewed and not unlike a distribution of response times. The code below shows some summary statistics of the generated data.
100
N <- 0.002
mu <- 0.0004
sigma <- 1 / rnorm(N, mu, sigma)
rt <-c(mean = mean(rt), sd = sd(rt), min = min(rt), max = max(rt))
## mean sd min max
## 527 122 319 1020
Figure 12.1 shows the distribution generated with the code shown above.
We can fit the rec-normal distribution to the response times in a simple model with a normal likelihood, by storing the reciprocal of the response times (1/RT
) in the vector variable recRT
(rather than storing the “raw” response times):
model {
\\ priors go here
target += normal_lpdf(recRT | mu, sigma);
}
One issue here is that the parameters of the likelihood, mu
and sigma
are going to be very far away from the unit scale (\(0.002\) and \(0.0004\) respectively). Due to the way Stan’s sampler is built, parameters that are too small (much smaller than 1) or too large (much larger than 1) can cause convergence problems. A straightforward solution is to fit the normal distribution to parameters in reciprocal of seconds rather than milliseconds; this would make the parameters have values that are \(1000\) times larger (\(2\) and \(0.4\) respectively). We can do this using the transformed parameters
block.46 Although we use mu
and sigma
to fit the data, the priors are defined on the parameters mu_s
and sigma_s
(\(\mu_s\) and \(\sigma_s\)).
Unless one can rely on previous estimates, finding good priors for \(\mu_s\) and \(\sigma_s\) is not trivial. To define appropriate priors, we would need to start with relatively arbitrary priors, inspect the prior predictive distributions, adjust the priors, and repeat the inspection of prior predictive distributions until these distributions start to look realistic. In the interest of conserving space, we skip this iterative process here, and assign the following priors:
\[\begin{equation} \begin{aligned} \mu_s & \sim \mathit{Normal}(2, 1)\\ \sigma_s & \sim \mathit{Normal}_+(0.4, 0.2) \end{aligned} \end{equation}\]
This model is implemented in the file normal_recrt.stan
, available in the bcogsci
package:
data {
int<lower = 1> N;
vector[N] recRT;
}
parameters {
real mu_s;
real<lower = 0> sigma_s;
}
transformed parameters {
real mu = mu_s / 1000;
real sigma = sigma_s / 1000;
}
model {
target += normal_lpdf(mu_s | 2, 1);
target += normal_lpdf(sigma_s | 0.4, 0.2) +
- normal_lccdf(0 | 0.4, 0.2);
target += normal_lpdf(recRT | mu, sigma);
}
Fit and display the summary of the previous model:
system.file("stan_models",
normal_recrt <-"normal_recrt.stan",
package = "bcogsci")
stan(normal_recrt, data = list(N = N, recRT = 1 / rt)) fit_rec <-
print(fit_rec, pars = c("mu", "sigma"), digits = 4)
## mean 2.5% 97.5% n_eff Rhat
## mu 0.0020 0.0019 0.0021 3878 1
## sigma 0.0004 0.0004 0.0005 3687 1
Is a rec-normal likelihood more appropriate than a log-normal likelihood? As things stand, we cannot compare the models with these two likelihoods. This is because the dependent variables are different: we cannot compare reciprocal response times with untransformed response times on the millisecond scale. Model comparison with the Bayes factor or cross-validation can only compare models with the same dependent variables; see chapters 14-16.
If we do want to compare the reciprocal-normal likelihood and the log-normal likelihood, we have to set up the models with the two likelihoods in such a way that the dependent measure is on the raw millisecond scale in each model. This means that for the reciprocal normal likelihood, the model will receive as data raw reading times in milliseconds, and these will be treated as a transformed random variable from reciprocal reading times. This approach is discussed next, but requires knowledge of the Jacobian adjustment (see Ross (2002)).
12.1.1 Scaling a probability density with the Jacobian adjustment
To work with the original dependent variable (RT
rather than 1/RT
) we need to perform a change of variables: If the random variable X represents the reciprocal reading times (\(1/RT\)), we can transform this random variable to a new one, \(Y = 1/X = 1/(1/RT) = RT\), yielding a transformed random variable \(Y\) which represents reading time.
This change of variables requires an adjustment to the (unnormalized log) posterior probability to account for the distortion caused by the transform.47 The probability must be scaled by a Jacobian adjustment, which, in the univariate case as this one, is the absolute value of the derivative of the transform.48 The absolute value of any number is represented by enclosing it in two vertical bars: e.g., \(|-2| = 2\).
\[\begin{equation} \begin{aligned} p(\mathit{RT}_n | \mu, \sigma )& = \mathit{Normal}(1/\mathit{RT}_n | \mu, \sigma) \left| \frac{d}{d\mathit{RT}_n} 1/\mathit{RT}_n \right| = \\ & \mathit{Normal}(1/\mathit{RT}_n | \mu, \sigma) \cdot \left| - 1/\mathit{RT}_n^2 \right| = \\ & \mathit{Normal}(1/\mathit{RT}_n | \mu, \sigma) \cdot 1/\mathit{RT}_n^2 \end{aligned} \end{equation}\]
If we omit the Jacobian adjustment, we are essentially fitting a different and unnormalized probability density function (this will be shown later in this chapter). The discrepancy between the correct and incorrect posterior would depend on how large the Jacobian adjustment for our model is.
Because Stan works in log space, rather than multiplying the Jacobian adjustment, we add its logarithm to the log-probability density (\(\log(1/\mathit{RT}_n^2) = - 2 \cdot \log(\mathit{RT}_n)\)). The contribution to the log likelihood (of \(\mu\) and \(\sigma\)) with its adjustment based from an individual observation, \(\mathit{RT}_{n}\), would be as follows.
\[\begin{equation} \log(\mathit{Normal}(1/\mathit{RT}_n | \mu, \sigma)) - 2 \cdot \log(\mathit{RT}_n) \end{equation}\]
We obtain the log likelihood based on all the \(N\) observations by summing the log likelihood of individual observations.
\[\begin{equation} \log \mathcal{L} = \sum_{n=1}^N \log(\mathit{Normal}(1/\mathit{RT}_n | \mu, \sigma)) - \sum_{n=1}^N 2 \cdot \log(\mathit{RT}_n) \end{equation}\]
In Stan, this summing up is done as follows. The function normal_lpdf
applied to a vector already returns the sum of the individual log-probability densities. The Jacobian adjustment 2*log(RT)
, which returns a vector of values, has to be summed up and added in manually (the term therefore becomes -sum(2*log(RT))
).
target += normal_lpdf(1 ./ RT | mu, sigma)
- sum(2 * log(RT));
Before fitting the model with the change of variables, we are going to truncate the distribution. We didn’t encounter negative values in our synthetic data, but this was because the distribution was not too spread out; that is, the scale was much smaller than the location (\(\sigma << \mu\)). However, in principle we could end up generating negative values. For this reason, we truncate the underlying normal distribution (for more details, see Box 4.1). The reciprocal truncated normal distribution has been argued to be an appropriate model of response times, neural inter-spike intervals, and latency distributions of saccades in a simple optimality model in which reward is maximized to yield an optimal response rate (Harris et al. 2014; Harris and Waddington 2012).
Because we have \(N\) observations, the truncation consists of adding - N * normal_lccdf(0 | mu, sigma)
to target
; also see section 10.4.1. The complete model, recnormal_rt.stan
, including the truncation is as follows:
data {
int<lower = 1> N;
vector[N] RT;
}
parameters {
real mu_s;
real<lower = 0> sigma_s;
}
transformed parameters {
real mu = mu_s / 1000;
real sigma = sigma_s / 1000;
}
model {
target += normal_lpdf(mu_s | 2, 1);
target += normal_lpdf(sigma_s | 0.4, 0.2)
- normal_lccdf(0 | 0.4, 0.2);
target += normal_lpdf(1 ./ RT | mu, sigma)
- N * normal_lccdf(0 | mu, sigma)
- sum(2 * log(RT));
}
Next, generate data from a reciprocal truncated normal distribution:
500
N <- .002
mu <- .0004
sigma <- 1 / rtnorm(N, mu, sigma, a = 0) rt <-
Fit the model to the data:
system.file("stan_models",
recnormal_rt <-"recnormal_rt.stan",
package = "bcogsci")
stan(recnormal_rt, data = list(N = N, RT = rt)) fit_rec <-
Print the posterior summary:
print(fit_rec, pars = c("mu", "sigma"), digits = 4)
## mean 2.5% 97.5% n_eff Rhat
## mu 0.0020 0.0020 0.0020 3779 1
## sigma 0.0004 0.0004 0.0004 3522 1
We get the same results as before, but now we could potentially compare this model to one that assumes a log-normal likelihood, for example.
An important question is the following: Does every transformation of a random variable require a Jacobian adjustment? Essentially, if we assign a distribution to a random variable and then transform it, this does not require a Jacobian adjustment.
Alternatively, if we apply a transformation on a random variable first, and then assign a distribution to the transformed random variable afterwards, we have a change of variables. This requires a Jacobian adjustment. This is the reason why 1 ./ RT
requires a Jacobian adjustment, but not mu_s
, mu
, sigma_s
, or sigma_s
in the previous model.
As a last step, we can encapsulate our new distribution in a function, and also create a random number generator function. This is done in a block called functions
. To create a function, we need to specify the type of every argument and the type that the function returns.
As a simple example, if we would like a function to center a vector, we would do it as follows:
functions {
vector center(vector x) {
vector[num_elements(x)] centered;
centered = x - mean(x);
return centered;
}
}
data {
...
}
parameters {
...
}
model {
...
}
We want to create a log(PDF) function, similar to the native Stan functions that end in _lpdf
. Our function will take as arguments a vector RT
, and real numbers mu
and sigma
. In _lpdf
functions, (some of) the arguments (e.g., RT
, mu
, and sigma
) can be vectorized, but the output of the function is always a real number: the sum of the log(PDF) evaluated at every value of the random variable at the left of the pipe. As we show below, to do that we just move the right hand side of target
in our original model inside our new function. See the section entitled Further Reading at the end of this chapter for more about Stan functions.
For our custom random number generator function (which should end with _rng
), we have the added complication that the function is truncated. We simply generate random values from a normal distribution, do the reciprocal transformation and if the number is above zero, this number is returned; otherwise, a new number is drawn.
This implementation may be inefficient when rejections occur frequently. Another, more complex implementation using the inverse CDF approach can be found in Stan Development Team (2024) (section 21.10 of the User’s guide).
The complete code can be found in recnormal_rt_f.stan
, and is also shown below:
functions {
real recnormal_lpdf(vector y, real mu, real sigma){
real lpdf;
lpdf = normal_lpdf(1 ./ y | mu, sigma)
- num_elements(y) * normal_lccdf(0 | mu, sigma)
- sum(2 * log(y));
return lpdf;
}
real recnormal_rng(real mu, real sigma){
real y = 0;
while (y <= 0)
y = 1 / normal_rng(mu, sigma);
return y;
}
}
data {
int<lower = 1> N;
vector<lower = 0>[N] RT;
}
parameters {
real mu_s;
real<lower = 0> sigma_s;
}
transformed parameters {
real mu = mu_s / 1000;
real sigma = sigma_s / 1000;
}
model {
target += normal_lpdf(mu_s | 2, 1);
target += normal_lpdf(sigma_s | 0.4, 0.2)
- normal_lccdf(0 | 0.4, 0.2);
target += recnormal_lpdf(RT | mu, sigma);
}
generated quantities {
array[N] real rt_pred;
for (n in 1:N)
rt_pred[n] = recnormal_rng(mu, sigma);
}
Fit the model to the simulated data:
system.file("stan_models",
recnormal_rt_f <-"recnormal_rt_f.stan",
package = "bcogsci")
stan(recnormal_rt_f, data = list(N = N, RT = rt)) fit_rec_f <-
Print the summary:
print(fit_rec_f, pars = c("mu", "sigma"), digits = 4)
## mean 2.5% 97.5% n_eff Rhat
## mu 0.0020 0.0020 0.0020 3643 0.999
## sigma 0.0004 0.0004 0.0004 3476 1.000
12.2 Validation of a computed posterior distribution
The model converged, but did it work? At the very least we expect that the simulated data that we used to test the model shows a very similar distribution to the posterior predictive distributions. This is because we are fitting the data with the same function that we use to generate the data. Figure 12.2 shows that the simulated data and the posterior predictive distributions are similar.
ppc_dens_overlay(rt, yrep = extract(fit_rec_f)$rt_pred[1:500, ]) +
coord_cartesian(xlim = c(0, 2000))
Our synthetic data set was generated by first defining a ground truth vector of parameters \(\langle \mu; \sigma \rangle\). We would expect that the true values of the parameters \(\mu\) and \(\sigma\) should be well inside the bulk of the posterior distribution of our model. We investigate this by plotting the true values of the parameters together with their posterior distributions using the function mcmc_recover_hist()
of bayesplot
as shown below in Figure 12.3.
as.data.frame(fit_rec_f) %>%
post_rec <- select(mu, sigma)
mcmc_recover_hist(post_rec, true = c(mu, sigma))
Even though this approach can capture serious misspecifications in our models, it has three important shortcomings.
First, it’s unclear what we should conclude if the true value of a parameter is not in the bulk of its posterior distribution. Even in a well-specified model, if we inspect enough parameters we will find that for some parameters, the bulks of their respective marginal posterior distributions can be relatively far away from their true values.
The second shortcoming is that if the posteriors that we obtain are very wide, it might not be clear what to consider the bulk of the posterior distribution that needs to be inspected.
The third shortcoming is that regardless of how a successful recovery of parameters is defined, we might be able to recover a posterior based on data generated from some parts of the parameter space, but not based on data generated from other parts of parameter space (Talts et al. 2018). However, inspecting the entire parameter space is infeasible.
An alternative to our approach of plotting the true values of the parameters together with their posterior distributions is simulation-based calibration (Talts et al. 2018; Modrák et al. 2023), which extends ideas of Cook, Gelman, and Rubin (2006). This is discussed next.
12.2.1 The simulation-based calibration procedure
Talts et al. (2018) suggest the simulation-based calibration as the procedure for validating a model. We check that the Bayesian computation is faithful in the sense that it is not biased. Our goal is to ensure that our model neither overestimates nor underestimates, nor does it possess incorrect precision in the parameters, given the data and priors. As initial steps, we employ two different implementations of the same statistical model: (1-2) a generator capable of directly simulating draws from the prior predictive distribution, and (3) a probabilistic program that, when combined with a posterior approximation algorithm (such as a Stan sampler), samples from the posterior distribution. Finally, (4-5) we verify that the results from both implementations have the same distribution conditional on the data.
- Generate true values (i.e., a ground truth) for the parameters of the model from the priors:
\[\begin{equation} \tilde{\Theta}_n \sim p(\boldsymbol{\Theta}) \end{equation}\]
Here, \(p(\boldsymbol{\Theta})\) represents a joint prior distributions of vector of parameters. In our previous example, the vector of parameters is \(\langle \mu; \sigma \rangle\), and the joint prior distribution is two independent distributions, a normal and a truncated normal. Crucially, the prior distributions should be meaningful; that is, one should use priors that are at least regularizing or weakly informative. The prior space plays a crucial role because it determines the parameter space where we will verify the correctness or faithfulness of the model.
Assuming 150 samples (from the priors), this step for our model would be as follows:
150
N_sim <- rnorm(N_sim, 2, 1) / 1000
mu_tilde <- rtnorm(N_sim, 0.4, 0.2, a = 0) / 1000 sigma_tilde <-
- Generate multiple (
N_sim
) data sets based on the probability density function used as the likelihood:
\[\begin{equation} \tilde{D}_n \sim p(\boldsymbol{D} | \tilde{\Theta}_n) \end{equation}\]
The previous equation indicates that each data set \(\tilde{D}_n\) is sampled from each of the generated parameters, \(\tilde{\Theta}_n\). Following our previous example, use the reciprocal truncated normal distribution to generate data sets of response times:
## Create a place holder for all the simulated datasets
vector(mode = "list", length = N_sim)
rt_tilde <-# Number of observations
500
N <-for (n in 1:N_sim) {
1 / rtnorm(N, mu_tilde[n], sigma_tilde[n], a = 0)
rt_tilde[[n]] <- }
Now, rt_tilde
consists of a list with 150 vectors of 500 observations each. Each list represents a simulated data, which corresponds to \(\tilde{D}_{n}\).
- Fit the same Bayesian model to each of the generated data sets.
After fitting the same model to each data set, we should compare the recovered posterior distribution for each parameter of each simulated data set; that is, \(p(\Theta_n | \tilde{D}_{n})\), with the parameters that were used to generate the data, \(\tilde{\Theta}_{n}\). This comparison is represented by ...
in the code below (that is, it is not yet implemented in the code shown below). This raises the question of how to compare the posterior distribution with the ground truth values of the parameters.
for (i in 1:N_sim) {
stan(recnormal_rt,
fit <-data = list(N = N, rt = rt_tilde[[n]])
)
... }
The procedure shown before (1-3) defines a natural condition for assessing whether the computed posterior distributions match the exact posterior distributions (Talts et al. 2018; Cook, Gelman, and Rubin 2006). This condition is met if the generator, probabilistic program, and posterior approximation algorithm function match: The generator and the probabilistic program must accurately represent the same data-generating process, while the posterior approximation algorithm should yield samples that closely match the correct posterior for the probabilistic program, based on data simulated from the prior. Any discrepancy suggests a mismatch among the components (Modrák et al. 2023).49
The next two steps describe a well-defined comparison of the exact posterior and prior distribution (different calibration checking methods differ in how exactly they test for mismatches between the distributions; Modrák et al. 2023).
- Generate an ensemble of rank statistics of prior samples relative to corresponding posterior samples.
Talts et al. (2018) demonstrate that the match between the exact posterior and prior can be evaluated by examining whether, for each element of \(\boldsymbol{\Theta}\) (e. g., \(\langle \mu; \sigma \rangle\)), the rank statistics of the prior sample relative to \(S\) posterior samples will be uniformly distributed across \([0, S]\). In other words, for each sample of the prior (e.g., \(\tilde{\mu}_{n}\)), we calculate the number of samples of the posterior that is smaller than the corresponding value of the parameter sampled, and we examine its distribution. A mismatch between the exact posterior and our posterior would show up as a non-uniform distribution.
As a next step, we examine this new distribution visually using a histogram with \(S + 1\) possible ranks (from \(0\) to \(S\)). There are two issues that we need to take into account (Talts et al. 2018):
Regardless of our model, histograms will deviate from uniformity if the posterior samples are dependent. This can be solved by thinning the samples of the posterior, that is removing a number of intermediate samples (Talts et al. 2018 recommend between 6 and 10).
To reduce the noise in the histogram, we should have bins of equal size. If \(S + 1\) are divisible by a large power of \(2\), e.g, \(1024\), we will be able to re-bin the histogram easily with bins of equal size. We complete the code shown in step 3 by generating an ensemble of ranks statistics as well.
Here we encapsulate steps 1 to 4 into a function.
function(...,
gen_ranks <-N_sim = 150,
N_s = 1024 * 6, # Total samples from the posterior
N = 500 # Number of observations
){ rnorm(N_sim, 2, 1) / 1000
mu_tilde <- rtnorm(N_sim, 0.4, 0.2, a = 0) / 1000
sigma_tilde <-# Thin by 6, and remove one extra sample so that S + 1 = 1024
seq(from = 1, to = N_s, by = 6)[-1]
thinner <-# Placeholders
rep(NA, N_sim)
rank_mu <- rep(NA, N_sim)
rank_sigma <- vector(mode = "list", length = N_sim)
rt_tilde <-# Loop over the simulations
for (n in 1:N_sim) {
message("Fit number ", n)
# it will repeat the sampling unless the model converged
repeat{
1 / rtnorm(N,
rt_tilde[[n]] <-
mu_tilde[n],
sigma_tilde[n],a = 0
)# Fit is only stored temporarily
stan(...,
fit <-data = list(N = N,
RT = rt_tilde[[n]]),
warmup = 2500,
iter = 2500 + N_s / 4,
refresh = -1,
control = list(adapt_delta = .999,
max_treedepth = 14))
# continue if there are no divergent transitions
# and all Rhat are smaller than 1.05
if (!get_num_divergent(fit) &
all(monitor(fit)[,"Rhat"] <= 1.05)) break
message("Repeating fit ",n)
} extract(fit)
post <-# Number of samples of the posterior that is smaller than
# the corresponding value of the parameter sampled
sum(post$mu[thinner] < mu_tilde[n])
rank_mu[n] <- sum(post$sigma[thinner] < sigma_tilde[n])
rank_sigma[n] <-
}#data frame with ranks:
tibble(sim = rep(1:N_sim, 2),
variable = rep(c("mu", "sigma"),
each = N_sim),
rank = c(rank_mu, rank_sigma))
}
set.seed(123)
# generate ranks with the default values:
gen_ranks(recnormal_rt) df_rank <-
- Graphically assess the correctness of the model using rank histograms
As a last step, we use a histogram for each parameter to identify deviations of uniformity. If the posterior estimates are correct, then each of the \(B\) bins has a probability of \(1/B\) that a simulation (i.e., an individual rank) falls into it: \(\mathit{Binomial}(N_{sim}, 1/B)\). This allow us to complement the histograms with confidence intervals, indicating where the variation expected from a uniform histogram should be.
Use the code below to build the plot shown in Figure 12.4. We can conclude that the implementation of our model (in the parameter space determined by the priors) is correct.
16
B <- 1024 / 16
bin_size <- qbinom(0.005, size = N_sim, prob = 1 / B)
ci_l <- qbinom(0.995, size = N_sim, prob = 1 / B)
ci_u <-ggplot(df_rank, aes(x = rank)) +
geom_histogram(breaks = seq(0, to = 1023, length.out = B + 1),
closed = "left", colour = "black") +
scale_y_continuous("count") +
geom_hline(yintercept = c(ci_l, ci_u), linetype = "dashed") +
facet_wrap(~variable, scales = "free_y")
Next, we consider an example where simulation-based calibration can reveal a problem.
12.2.2 An example where simulation-based calibration reveals a problem
Let’s assume that we made an error in the model implementation. We’ll fit an incorrect model: rather than the complete likelihood including the normal PDF and the truncation and Jacobian adjustments, we will fit only the normal PDF evaluated at the reciprocal of the response times, that is target += normal_lpdf(1 ./ RT | mu, sigma);
.
Assessing the correctness of the model by looking at the recovery of the parameters is misleading in this case, as it is evident from Figure 12.5. One could conclude that the model is fine based on this plot, since the true values of the parameters are well inside the bulk of their posterior distributions.
However, the rank histograms produced by the simulation-based calibration procedure in Figure 12.6 show very tall bins at the left and at the right of each histogram, exceeding the 99% CI. This is a very clear indication that there is a problem with the model specification: our incorrect model is overestimating \(\mu\) and underestimating \(\sigma\). This example illustrates the importance of simulation-based calibration.
12.2.3 Issues with and limitations of simulation-based calibration
In the previous sections, we have used only rank histograms. Even though histograms are a very intuitive means of visualization to assess model correctness, they might not be sensitive enough for small deviations (Talts et al. 2018). Other visualizations might be better suited. Another limitation of histograms is that they are sensitive to the number of bins (Säilynoja, Bürkner, and Vehtari 2022). One could bin the histograms multiple times, but this approach is difficult to interpret when there are many parameters, and can be vulnerable to multiple testing biases (Talts et al. 2018). One alternative to rank histograms is visualizations based on the empirical cumulative density function of the ranks. This is briefly presented in Box 12.1.
Box 12.1 Different rank visualizations and the SBC
package.
Implementing the simulation-based calibration algorithm “by hand” introduces a new source of potential errors. Fortunately, the R package SBC
(Kim et al. 2024) provides tools to validate a Stan model (or any sampling algorithm) by allowing us to run simulation-based calibrations easily.
The package is in active development at the moment50 and can be installed with the following command.
::install_github("hyunjimoon/SBC") remotes
One of the main advantages of this package is that it provides several ways to visualize the results of the simulation-based calibration procedure; see https://hyunjimoon.github.io/SBC/. Figure 12.7 shows rank histograms produced by SBC
of a correct model and several different incorrect models. An alternative to rank histograms is to use an empirical cumulative distribution function (ECDF)-based method, as proposed by Säilynoja, Bürkner, and Vehtari (2022). The idea behind this method is that if the ranks produced by the simulation-based calibration algorithm are uniform the ECDF of the ranks should be close to the CDF of a uniform distribution. Figure 12.8 shows the difference between the ECDF of the ranks and the CDF of a uniform distribution together with 95% confidence bands (this is the default in the SBC
package) for a correct model and different incorrect ones.
Even though simulation-based calibration is a comprehensive approach to test model correctness, it has several drawbacks, regardless of the means of visualization:
This approach requires fitting a model many times, which can be too time intensive for complex models. It’s worth bearing in mind that code with errors can sometimes show clear “catastrophic” failures in the recovery of the parameters. For this reason, one could start the verification of model correctness with a simple recovery of parameters (see Talts et al. 2018; Gelman et al. 2020), and then proceed with the simulation-based calibration procedure for at least some suspect aspects of the model (e.g., when working with a custom likelihood).
Another major limitation of simulation-based calibration is that it is concerned exclusively with the computational aspects of the analysis and offers no guarantee for any single observation. A complete Bayesian workflow should include prior and posterior predictive checks, see chapter 7 and Schad, Betancourt, and Vasishth (2020).
Finally, it’s important to apply simulation-based calibration only after appropriate priors are set. This is important because only the priors determine the parameter space that will be inspected. Weak priors that may be good enough for estimation may nevertheless lead to a parameter space that is unrealistically large since, during the calibration stage, there are no independent data to constrain the space. This means that it is important to conduct prior predictive checks before assessing the correctness of the model with simulation-based calibration.
12.3 Another custom distribution: The exponential distribution implemented manually
There are cases when one needs a random variable with a distribution that is not included in Stan. Many times one can find the PDF (or a CDF) derived in a paper, and one needs to implement it manually by writing the log PDF in the Stan language. Even though the exponential distribution is included in Stan, we demonstrate how we would include it step-by-step as if it weren’t available. This example extends what is demonstrated in Stan Development Team (2024) (section 22.1 of the User’s guide).
The exponential distribution is used to model waiting times until a certain event occurs. Its PDF is the following:
\[\begin{equation} f(x | \lambda )={ \begin{cases} \lambda e^{-\lambda x} & x\geq 0,\\0&x<0. \end{cases}} \end{equation}\]
The parameter \(\lambda\) is often called the rate parameter and must be positive. A higher value for the rate parameter leads to shorter waiting times on average. The mean of the distribution is \(1/\lambda\). The exponential distribution has the key property of being memoryless. What this means is explained in Ross (2002) as follows: Suppose that \(X\) is a random variable with an exponential distribution as a PDF; the random variable represents the lifetime of an item (e.g., \(X\) could represent the time it takes for a radioactive particle to completely decay). If the item is \(t\) time-units old, then the remaining life \(s\) of that item has the same probability distribution as the life of a new item. Mathematically, this amounts to stating that
\[\begin{equation} P(X>s + t | X> t)= P(X>s) \end{equation}\]
The implication of the memoryless property is that we do not need to know the age of an item to know what the distribution of its remaining life is.
To give another example of memorylessness, the conditional probability that a certain event will happen in the next 100 ms is the same regardless of whether we have been already waiting 1000 ms, 10 ms, or 0 ms. Although the exponential distribution is not commonly used for modeling response times in cognitive science, it has been used in the past (Ashby and Townsend 1980; Ashby 1982). We focus on this distribution because of its simple analytical form.
As a first step, we make our own version of the PDF in R, and then check that it integrates to 1 to verify that this is actually a proper distribution (and that we haven’t introduced a typo in the formula).
To avoid underflow, that is getting a zero instead of a very small number, we’ll work in the log-scale. This will also be useful when we implement this function in Stan, thus the log PDF is
\[\begin{equation} \log(f(x| \lambda ))= \log(\lambda) -\lambda x \end{equation}\]
where \(x >0\).
Implement this function in R with the same arguments as the d*
family of functions: if log = TRUE
the output is a log density; call this new function dexp2
.
function(x, lambda = 1, log = FALSE) {
dexp2 <- log(lambda) - lambda * x
log_density <-if (log == FALSE) {
exp(log_density)
else {
}
log_density
} }
Verify that this function integrates to 1 for some point values of its parameter \(\lambda\) (here, \(\lambda = 1\) and \(\lambda = 20\)):
function(x) dexp2(x, 1)
dexp2_l1 <-integrate(dexp2_l1, lower = 0, upper = Inf)$value
## [1] 1
function(x) dexp2(x, 20)
dexp2_l20 <-integrate(dexp2_l20, lower = 0, upper = Inf)$value
## [1] 1
To test our function, we’ll also need to generate random values from the exponential distribution. If the quantile function of the distribution exists, the inverse transform sampling is a relatively straightforward way to get pseudo random numbers sampled from a target distribution (for an accessible introduction to inverse sampling, see Lynch 2007). Given a target distribution with a PDF \(f\), and a quantile function \(F^{-1}\) (the inverse of the CDF), the inverse transform sampling method consists of the following:
- Sample one number \(u\) from \(\mathit{Uniform}(0,1)\).
- Then \(z=F^{-1}(u)\) is a draw from \(f(x)\).
In this case, the quantile function (the inverse of the CDF) is the following:
\[\begin{equation} - \log(1-p)/\lambda \end{equation}\]
Here is how one can derive this inverse of the CDF. First, consider the fact that the CDF of the exponential distribution is as follows. The term \(q\) is some quantile of the distribution:
\[\begin{equation} F(q) = \int_0^q \lambda \exp(-\lambda x)\, dx \end{equation}\]
We can solve this integral by using the \(u\)-substitution method (Salas, Etgen, and Hille 2003). First, define
\[\begin{equation} u = g(x) = -\lambda x \end{equation}\]
Then, the derivative \(du/dx\) is:
\[\begin{equation} \frac{du}{dx} = -\lambda \end{equation}\]
This implies that \(du = - \lambda dx\), or that \(- du = \lambda dx\).
In the CDF, replace the term \(\lambda dx\) with \(-du\):
\[\begin{equation} F(q) = \int_0^q \lambda \exp(-\lambda x)\, dx = \int_0^{-\lambda q} (- \exp(-\lambda x)\, du) \end{equation}\]
Rewriting \(-\lambda x\) as \(u\), the CDF simplifies to:
\[\begin{equation} F(q) = \int_0^q \lambda \exp(-\lambda x)\, dx = \int_0^q (- \exp(u)\, du) \end{equation}\]
We know from calculus that the integral of \(-\exp(u)\) is \(-\exp(u)\). So, the integral becomes:
\[\begin{equation} F(q) = \left[ - \exp(u) \right]_0^q \end{equation}\]
Replacing \(u\) with \(-\lambda x\), we get:
\[\begin{equation} F(q) = \left[ - \exp(-\lambda x) \right]_0^q = 1- \exp(-\lambda q) \end{equation}\]
Thus, we know that the CDF is \(F(t) = 1- \exp(-\lambda q) = p\), where \(p\) is the probability of observing the quantile \(q\) or some value smaller than \(q\). To derive the inverse of the CDF, solve the equation below for \(q\):
\[\begin{equation} \begin{aligned} p &= 1- e^{-\lambda q}\\ p-1 &= - e^{-\lambda q}\\ -p+1 &= e^{-\lambda q}\\ \log(1-p) &= -\lambda q\\ - \log(1-p)/\lambda &= q\\ \end{aligned} \end{equation}\]
Write the quantile function (the inverse of the CDF) and the random number generator for the exponential distribution in R. To differentiate it from the built-in function in R, we will call it qexp2
:
function(p, lambda = 1) {
qexp2 <--log(1 - p) / lambda
} function(n, lambda = 1) {
rexp2 <- runif(n, 0, 1)
u <-qexp2(u, lambda)
}
The functions that we would use in a Stan model are relatively faithful to the R code, but follow Stan conventions: The expression log1m(p)
is more arithmetically stable than log(1 - .)
for values of p
close to one, the function exp_lpdf
stores the sum of the log(PDF) evaluated at each value of x, this would be analogous to doing sum(dexp2(x, lambda, log = TRUE))
; the function exp_rng
implements a non-vectorized version of rexp2
and uses the auxiliary function exp_icdf
(icdf stands for inverse CDF), which is similar to qexp2
.
functions {
real exp_lpdf(vector x, real lambda){
vector[num_elements(x)] lpdf = log(lambda) - lambda * x;
return sum(lpdf);
}
real exp_icdf(real p, real lambda){
return -log1m(p) / lambda;
}
real exp_rng(real lambda){
real u = uniform_rng(0, 1);
return exp_icdf(u, lambda);
}
}
We are now ready to generate synthetic data and fit the distribution in Stan. Generate 1000 observations.
1000
N <- 1 / 200
lambda <- rexp2(N, lambda) rt <-
Use exponential.stan
, which includes the function block shown earlier, and the following obligatory blocks:
data {
int<lower = 1> N;
vector<lower = 0>[N] RT;
}
parameters {
real<lower = 0> lambda;
}
model {
target += normal_lpdf(lambda | 0, .1) -
normal_lccdf(0 | 0, .1);
target += exp_lpdf(RT | lambda);
}
generated quantities {
array[N] real rt_pred;
for (n in 1:N)
rt_pred[n] = exp_rng(lambda);
}
Fit the data with Stan.
system.file("stan_models",
exponential <-"exponential.stan",
package = "bcogsci")
stan(exponential, data = list(N = N, RT = rt)) fit_exp <-
Print the summary:
print(fit_exp, pars = c("lambda"))
## mean 2.5% 97.5% n_eff Rhat
## lambda 0.01 0 0.01 1471 1
Carry out a quick check first, verifying that the true value of the parameter \(\lambda\) is reasonably inside the bulk of the posterior distribution. This is shown in Figure 12.9(a). Figure 12.9(b) shows the results of the simulation-based calibration procedure, and shows that our implementation was correct.
# Plot a
as.data.frame(fit_exp) %>%
post_exp <- select("lambda")
mcmc_recover_hist(post_exp, true = lambda) +
# make it similar to plot b:
ggtitle("a") +
xlab("lamda") +
theme(plot.title = element_text(hjust = 0),
legend.position="none",
strip.text.x = element_blank())
# Plot b
16
B <- 1024 / 16
bin_size <- 200
N_sim = qbinom(0.005, size = N_sim, prob = 1 / B)
ci_l <- qbinom(0.995, size = N_sim, prob = 1 / B)
ci_u <-ggplot(df_rank_lambda, aes(x = rank)) +
geom_histogram(breaks =
seq(0, to = 1023, length.out = B + 1),
closed = "left", colour = "black") +
scale_y_continuous("count") +
geom_hline(yintercept = c(ci_l, ci_u), linetype = "dashed") +
ggtitle("b") +
theme(plot.title = element_text(hjust = 0))
12.4 Summary
In this chapter, we learned how to create a PDF that is not provided by Stan. We learned how to do this by doing a change of variables (with its Jacobian adjustment) and by building the distribution from scratch in a function ending in _lpdf
. We also learned how to verify the correctness of the new functions by recovering the true values of the parameters and by using simulation-based calibration.
12.5 Further reading
Jacobian adjustments in Bayesian models are an ongoing source of confusion and there are several posts in blogs and study cases that try to shed light on them:
- https://rstudio-pubs-static.s3.amazonaws.com/486816_440106f76c944734a7d4c84761e37388.html
- https://betanalpha.github.io/assets/case_studies/probability_theory.html#42_probability_density_functions
- https://jsocolar.github.io/jacobians/
- https://mc-stan.org/documentation/case-studies/mle-params.html
Jacobian adjustments are also relevant for adjusting priors for order constraints, which is especially important for Bayes factors (Heck and Wagenmakers 2016).
A complete tutorial of simulation-based calibration using the SBC
package was presented in the online event Stanconnect 2021, and it is available in https://www.martinmodrak.cz/post/2021-sbc_tutorial/. The ideas that predate this technique can be found in Cook, Gelman, and Rubin (2006) and were extended in Talts et al. (2018). The use of ECDF-based visualizations is discussed in Säilynoja, Bürkner, and Vehtari (2022). The role of simulation-based calibration in the Bayesian workflow is discussed in Schad, Betancourt, and Vasishth (2020). Examples of the use of simulation-based calibration to validate novel models used in cognitive science are Hartmann, Johannsen, and Klauer (2020) and Bürkner and Charpentier (2020). The extension of this procedure to validate Bayes factors is discussed in Schad et al. (2021).
Custom functions in general and custom probability functions are treated in Stan Development Team (2024) (chapter 21 of the User’s guide).
12.6 Exercises
Exercise 12.1 Fitting a shifted log-normal distribution.
A random variable \(Y\) has a shifted log-normal distribution with shift \(\psi\), location \(\mu\), and scale \(\sigma\), if \(Z = Y-\psi\) and \(Z \sim \mathit{LogNormal}(\mu,\sigma)\).
Implement a
shifted_lognormal_ldpf
function in Stan with three parameters,mu
,sigma
, andpsi
. Tip: One can use the regular log-normal distribution and apply a change of variable. In this case the adjustment of the Jacobian would be \(|\frac{d}{dY} Y - \psi|=1\), which in log-space is conveniently zero.Verify the correctness of the model by recovering the true values of (your choice) of the parameters of the model and by using simulation-based calibration. In order to use simulation-based calibration, you will need to decide on sensible priors; assume that \(\psi \sim \mathit{Normal}_+(100,50)\), and choose priors for \(\mu\) and \(\sigma\) so that the prior predictive distributions are adequate for response times.
Exercise 12.2 Fitting a Wald distribution.
The Wald distribution (or inverse Gaussian distribution) and its variants have been proposed as another useful distribution for response times (see for example Heathcote 2004).
The probability density function of the Wald distribution is the following.
\[\begin{equation} f(x;\mu ,\lambda )={\sqrt {\frac {\lambda }{2\pi x^{3}}}}\exp {\biggl (}-{\frac {\lambda (x-\mu )^{2}}{2\mu ^{2}x}}{\biggr )} \end{equation}\]
- Implement this distribution in Stan as
wald_lpdf
. In order to do this, you will need to derive the logarithm of the PDF presented above. You can adapt the code of the following R function.
function(x, lambda, mu, log = FALSE) {
dwald <- 0.5 * log(lambda / (2 * pi())) -
log_density <- 1.5 * log(x) -
0.5 * lambda * ((x - mu) / (mu * sqrt(x)))^2
if (log == FALSE) {
exp(log_density)
else {
}
log_density
} }
- Verify the correctness of the model by recovering the true values of (your choice) of the parameters of the model and by using simulation-based calibration. As with the previous exercise, you will need to decide on sensible priors by deriving prior predictive distributions that are adequate for response times.
References
Adams, Robert A., and Christopher Essex. 2018. Calculus: A Complete Course. Pearson.
Ashby, F. Gregory. 1982. “Testing the Assumptions of Exponential, Additive Reaction Time Models.” Memory & Cognition 10 (2): 125–34.
Ashby, F. Gregory, and James T. Townsend. 1980. “Decomposing the Reaction Time Distribution: Pure Insertion and Selective Influence Revisited.” Journal of Mathematical Psychology 21 (2): 93–123.
Box, George E. P., and David R. Cox. 1964. “An Analysis of Transformations.” Journal of the Royal Statistical Society. Series B (Methodological), 211–52.
Bürkner, Paul-Christian, and Emmanuel Charpentier. 2020. “Modelling Monotonic Effects of Ordinal Predictors in Bayesian Regression Models.” British Journal of Mathematical and Statistical Psychology. https://doi.org/https://doi.org/10.1111/bmsp.12195.
Cook, Samantha R., Andrew Gelman, and Donald B. Rubin. 2006. “Validation of Software for Bayesian Models Using Posterior Quantiles.” Journal of Computational and Graphical Statistics 15 (3): 675–92. https://doi.org/10.1198/106186006X136976.
Gelman, Andrew, Aki Vehtari, Daniel P. Simpson, Charles C. Margossian, Bob Carpenter, Yuling Yao, Lauren Kennedy, Jonah Gabry, Paul-Christian Bürkner, and Martin Modrák. 2020. “Bayesian Workflow.” arXiv Preprint arXiv:2011.01808.
Harris, Christopher M., and Jonathan Waddington. 2012. “On the Convergence of Time Interval Moments: Caveat Sciscitator.” Journal of Neuroscience Methods 205 (2): 345–56. https://doi.org/https://doi.org/10.1016/j.jneumeth.2012.01.017.
Harris, Christopher M., Jonathan Waddington, Valerio Biscione, and Sean Manzi. 2014. “Manual Choice Reaction Times in the Rate-Domain.” Frontiers in Human Neuroscience 8: 418. https://doi.org/10.3389/fnhum.2014.00418.
Hartmann, Raphael, Lea Johannsen, and Karl Christoph Klauer. 2020. “rtmpt: An R Package for Fitting Response-Time Extended Multinomial Processing Tree Models.” Behavior Research Methods 52 (3): 1313–38.
Heathcote, Andrew. 2004. “Fitting Wald and ex-Wald Distributions to Response Time Data: An Example Using Functions for the S-Plus Package.” Behavior Research Methods, Instruments, & Computers 36 (4): 678–94. https://doi.org/https://doi.org/10.3758/BF03206550.
Heck, Daniel W, and Eric-Jan Wagenmakers. 2016. “Adjusted Priors for Bayes Factors Involving Reparameterized Order Constraints.” Journal of Mathematical Psychology 73: 110–16. https://doi.org/https://doi.org/10.1016/j.jmp.2016.05.004.
Kim, Shinyoung, Hyunji Moon, Martin Modrák, and Teemu Säilynoja. 2024. SBC: Simulation Based Calibration for Rstan/Cmdstanr Models. https://hyunjimoon.github.io/SBC/.
Lynch, Scott Michael. 2007. Introduction to Applied Bayesian Statistics and Estimation for Social Scientists. New York, NY: Springer.
Modrák, Martin, Angie H. Moon, Shinyoung Kim, Paul-Christian Bürkner, Niko Huurre, Kateřina Faltejsková, Andrew Gelman, and Aki Vehtari. 2023. “Simulation-Based Calibration Checking for Bayesian Computation: The Choice of Test Quantities Shapes Sensitivity.” Bayesian Analysis, 1–28. https://doi.org/10.1214/23-BA1404.
Ross, Sheldon. 2002. A First Course in Probability. Pearson Education.
Säilynoja, Teemu, Paul-Christian Bürkner, and Aki Vehtari. 2022. “Graphical Test for Discrete Uniformity and Its Applications in Goodness-of-Fit Evaluation and Multiple Sample Comparison.” Statistics and Computing 32 (2): 1–21. https://doi.org/https://doi.org/10.1007/s11222-022-10090-6.
Salas, Saturnino L., Garret J. Etgen, and Einar Hille. 2003. Calculus: One and Several Variables. Ninth. John Wiley & Sons.
Schad, Daniel J., Michael J. Betancourt, and Shravan Vasishth. 2019. “Toward a Principled Bayesian Workflow in Cognitive Science.” arXiv Preprint. https://doi.org/10.48550/ARXIV.1904.12765.
2020. “Toward a Principled Bayesian Workflow in Cognitive Science.” Psychological Methods 26 (1): 103–26. https://doi.org/https://doi.org/10.1037/met0000275.Schad, Daniel J., Bruno Nicenboim, Paul-Christian Bürkner, Michael J. Betancourt, and Shravan Vasishth. 2021. “Workflow Techniques for the Robust Use of Bayes Factors.” arXiv Preprint arXiv:2103.08744.
Stan Development Team. 2024. “Stan Modeling Language Users Guide and Reference Manual, Version 2.32.” https://mc-stan.org/docs/2_35/.
Talts, Sean, Michael J. Betancourt, Daniel P. Simpson, Aki Vehtari, and Andrew Gelman. 2018. “Validating Bayesian Inference Algorithms with Simulation-Based Calibration.” arXiv Preprint arXiv:1804.06788.
Wu, Fuyun, Elsi Kaiser, and Shravan Vasishth. 2017. “Effects of Early Cues on the Processing of Chinese Relative Clauses: Evidence for Experience-Based Theories.” Cognitive Science 42: 1101–33.
Yao, Yuling, Aki Vehtari, Daniel P. 2018. “Yes, but Did It Work?: Evaluating Variational Inference.” In International Conference on Machine Learning, 5581–90. PMLR.
In this specific case, we could also transform the dependent variable first, but as will become clear later, we want to avoid changing our dependent variable.↩︎
Not every transformation is valid, univariate changes of variables must be monotonic and differentiable, multivariate changes of variables must be surjective and differentiable.↩︎
In the multivariate case, it is equal to the absolute determinant of the Jacobian, the matrix of all its first-order partial derivatives, of the transform; see section 6.7 of Ross (2002), and p. 707 of Adams and Essex (2018).↩︎
We are working under the assumption that Stan, which yields the posterior approximation, works correctly. In principle, if we assume that our model is correct, we can also use simulation-based calibration to examine whether our approximation to the posterior distribution is correct (that is, whether Stan’s sampler, or any posterior approximation method works correctly); for an example, see Yao et al. (2018).↩︎
Even though the package is already fully functional, function names and arguments might change by the time this book is published.↩︎