B Advanced models with Stan - Extended

B.1 What does `target` do in Stan models?

We can exemplify how target works with one hypothetical iteration of the sampler in the model normal.stan discussed in section 8.2 and shown below:

data {
  int<lower = 1> N;  // Total number of trials
  vector[N] y;  // Score in each trial
}
parameters {
  real mu;
  real<lower = 0> sigma;
}
model {
  // Priors:
  target += normal_lpdf(mu | 0, 20);
  target += lognormal_lpdf(sigma | 3, 1);
  // Likelihood:
  target += normal_lpdf(y | mu, sigma);
}

In every iteration where the sampler explores the posterior space, mu and sigma acquire different values (this is where the Stan algorithm stops the movement of the particle in the Hamiltonian space). Say that in an iteration, mu = 4.115 and sigma = 10.794. Then the following happens in the model block:

At the beginning of the iteration, target is zero.
The transformations that the sampler automatically does are taken into account. In our case, although sigma is constrained to be positive in our model, inside Stan’s sampler it is transformed to an unconstrained space amenable to Hamiltonian Monte Carlo. That is, Stan samples from an auxiliary parameter that ranges from minus infinity to infinity, which is equivalent to log(sigma). This auxiliary parameter is then exponentiated, when it is incorporated into our model. Because of the mismatch between the constrained parameter space that we defined and the unconstrained space that it is converted to by Stan, an adjustment to the unnormalized posterior is required and added automatically. The reasons for this requirement are somewhat complex and will be discussed in section 10. In this particular case, this adjustment (which is the log absolute value of the Jacobian determinant), is equivalent to adding log(sigma) = 2.379 to target.
After target += normal_lpdf(mu | 0, 20); the log of the density of $\mathit{Normal}(0,20)$ is evaluated at a given sample of mu (specifically 4.115) and this is added to target. In R, this would be dnorm(x = 4.115, mean = 0, sd = 20, log = TRUE), which is equal to -3.936. Thus, target should be -3.936 + 2.379 = -1.557.
After target += lognormal_lpdf(sigma | 3, 1), we add the log of the density of $\mathit{LogNormal}(3, 1)$ evaluated at 10.794 to the previous value of the target. In R, this would be dlnorm(x = 10.794 , mean = 3, sd = 1, log = TRUE), which is equal to -3.491. Thus, target should be updated to -1.557 + -3.491 = -5.048.
After each iteration of the for-loop in the model block, we add to the target the log density of $\mathit{Normal}( 4.115, 10.794)$ evaluated at each of the values of Y. In R, this would be to add sum(dnorm(Y, 4.115, 10.794, log = TRUE)) (which is equal to -368.997) to the current value of target -5.048 + -368.997 = -374.045.

This means that for the coordinates [mu = 4.115, sigma = 10.794], the height of the unnormalized posterior would be the value exp(target) = $\exp( -374.045 ) = 3.584\times 10^{-163}$. Incidentally, the value of target is returned as lp__ (log probability) in an object storing a fit model with Stan.

It is possible to expose the value of target, by printing target() inside a Stan model. The value of target after each iteration is named lp__ in the Stan object. This can be useful for troubleshooting a problematic model.

B.2 Explicitly incrementing the log probability function (`target`) vs. using the sampling or distribution `~` notation

In this book, we specify priors and likelihoods by explicitly incrementing the log-probability function using the following syntax:

target +=  pdf_name_lpdf(parameter | ...)

However, Stan also allows for specifying priors and likelihood with the so-called sampling or distribution statement notation with the following code.

parameter ~ pdf_name(..)

Confusingly enough a sampling statement does not perform any actual sampling, and it is meant to be a notational convenience.

The following two lines of code lead to the same behavior in Stan with respect to parameter estimation. There is, nonetheless, an important difference between them.

x ~ normal(mu, sigma);
target += normal_lpdf(x | mu, sigma);

The important difference is that the sampling notation (the notation with the $\sim$) will drop normalizing constants. Consider the following formula that corresponds to the log-transformed PDF of a normal distribution:

\[\begin{equation} -log(\sigma) - \frac{log(2 \pi)}{2} - \frac{(x-\mu)^2}{2 \sigma^2} \end{equation}\]

If one uses the sampling notation, Stan will ignore the terms that don’t contain parameters, such as $- \frac{log(2 \pi)}{2}$. Depending on whether the variable $x$, the location $\mu$, and the scale $\sigma$ are data or parameters, Stan will ignore different terms. For example, consider the case of a linear regression. The data $y$ (taking the role of $x$ in the previous equation) is assumed to be normally distributed with a location ($\mu$) and scale ($\sigma$) to be estimated. In this case, only $-\frac{log(2 \pi)}{2}$ can be dropped, because both $-log(\sigma)$ and $- \frac{(y-\mu)^2}{2 \sigma^2}$ contain parameters. Another example where different terms would be dropped is the case of assigning a normal prior distribution to a parameter $\theta$. Here, the location and scale ($\mu$ and $\sigma$) are data and $\theta$ takes the role of $x$ in the previous equation and acts as a parameter. This means that $-log(\sigma)$ is a constant term that can be ignored, but not $- \frac{(\theta-\mu)^2}{2 \sigma^2}$ because it contains the parameter $\theta$. Dropping constant terms does not affect parameter estimation because it only affects the unnormalized likelihood in the same way in all the parameter space. To make this more concrete, the whole plot in Figure 8.1 will move up or down by some constant amount, and this won’t affect the Hamiltonian dynamics that determine how we sample from the posterior.

The advantage of the sampling notation is that it can be faster (when many terms are ignored), but the disadvantage is that (i) it is not compatible with the calculation of Bayes factor with bridge sampling (see section 13.4 in chapter 13), or the calculation of the log-likelihood for cross-validation (see chapter 14), (ii) it misleads us into thinking that Stan is actually sampling the left term in the sampling statement, e.g., drawing $y$ from a normal distribution in the previous example, when in fact at each step the log-probability (target) is incremented based on the parameter values determined by Hamiltonian dynamics (as explained before), and (iii) it makes it less straightforward to transition to more complex models where the sampling notation cannot be used (as in, for example, mixture models in chapter 17).

If one is not going to use Bayes factor with bridge sampling or cross-validation, the same speed advantage of the sampling notation can also be achieved by incrementing the log-probability with log-unnormalized probability density or mass functions (functions ending with _lupdf or _lupmf). The previous example would be translated into the following:

target += normal_lupdf(y | mu, sigma);

B.3 An alternative R interface to Stan: `cmdstanr`

At the time of writing this, there are two major nuisances with rstan, (i) the R code interfaces directly with C++ creating installation problems in many systems, (ii) rstan releases lag behind Stan language releases considerably preventing the user from taking advantage of the latest features of Stan. The package cmdstanr (https://mc-stan.org/cmdstanr/) is a lightweight interface to Stan for R that solves these problems. The downside (at the moment of writing this) is that, being lightweight, some functionality of rstan is lost, such as looking up functions with lookup(), as well as using the fitted model with the bridgesampling package to generate Bayes factors. Furthermore, the package cmdstanr is currently under development and the application programming interface (API) might still change. However, the user interested in an easy (and painless) installation and the latest features of Stan might find it useful.

Once cmdstanr is installed, we can use it as follows:

First create a new CmdStanModel object from a file containing a Stan program using cmdstan_model()

normal <- system.file("stan_models",
                      "normal.stan",
                      package = "bcogsci")
normal_mod <- cmdstan_model(normal)

The object normal_mod is an R6 reference object (https://r6.r-lib.org/). This class of object behaves similarly to objects in object oriented programming languages, such as python. Methods are accessed using $ (rather than . as in python).

To sample, use the $sample() method. The data argument accepts a list (as we used in stan() from rstan). However, many of the arguments of $sample have different names than the ones used in stan() from the rstan package:

lst_score_data <- list(y = y, N = length(y))
fit_normal_cmd <- normal_mod$sample(data = lst_score_data,
                                    seed = 123,
                                    chains = 4,
                                    parallel_chains = 4,
                                    iter_warmup = 1000,
                                    iter_sampling = 1000)

To show the posterior summary, access the method $summary() of the object fit_normal_cmd.

fit_normal_cmd$summary()

Access the samples of fit_normal_cmd using $draws().

fit_normal_cmd$draws(variables = "mu")

The vignette of https://mc-stan.org/cmdstanr/ shows more use cases, and how the samples can be transformed into other formats (data frame, matrix, etc.) together with the package posterior (https://mc-stan.org/cmdstanr/).

B.4 Matrix, vector, or array in Stan?

Stan contains three basic linear algebra types, vector, row_vector, and matrix. But Stan also allows for building arrays of any dimension from any type of element (integer, real, etc.). This means that there are several ways to define one-dimensional N-sized containers of real numbers,

array[N] real a;
vector[N] a;
row_vector[N] a;

as well as, two-dimensional N1$\times$N2-sized containers of real numbers:

array[N1, N2] real m;
matrix[N1, N2] m;
array[N1] vector[N2] b;
array[N1] row_vector[N2] b;

These distinctions affect either what we can do with these variables, or the speed of our model, and sometimes are interchangeable. Matrix algebra is only defined for (row) vectors and matrices, that is we cannot multiply arrays. The following line requires all the one-dimensional containers (p_size and c_load) to be defined as vectors of the same size (or all as row vectors):

p_size = alpha + c_load * beta;

Many “vectorized” operations are also valid for arrays, that is, normal_lpdf, accepts (row) vectors (as we did in our code) or arrays as in the next example. There is of course no point in converting a vector to an array as follows, but this shows that Stan allows both type of one-dimensional containers.

array[N] real mu = to_array_1d(alpha + c_load * beta);
target += normal_lpdf(p_size | mu, sigma);

By contrast, the outcome of “vectorized” pseudorandom number generator (_rng) functions can only be stored in an array. The following example shows the only way to vectorize this type of function:

array[N] real p_size_pred = normal_rng(alpha + c_load * beta,
                                       sigma);

Alternatively, one can always use a for-loop, and it won’t matter if p_size_pred is an array or a vector:

vector[N] p_size_pred;
for(n in 1:N)
    p_size_pred[n] = normal_rng(alpha + c_load[n] * beta, sigma);

See also Stan’s user’s guide section on matrices, vector, and arrays (Stan Development Team 2024, Chapter 18 of the User’s guide).

B.5 A simple non-centered parameterization

Stan’s sampler can explore the parameter space more easily if its step size is appropriate for all the parameters. This is achieved when there are no strong dependencies between parameters. In section 9.1.2, we fit an uncorrelated varying intercept and slopes model with Stan, and we assume the following.

\[\begin{equation} \mathbf{u}_{2} \sim \mathit{Normal}(0, \tau_{u_2}) \end{equation}\]

where $\mathbf{u}_{2}$ is the column vector of $u_{i,2}$’s. The index $i$ refers to the subject id.

We can transform a vector $v$ into $z$-scores as follows.

\[\begin{equation} \mathbf{z}_{v} =\frac{\mathbf{v} - mean(\mathbf{v})}{SD(\mathbf{v})} \end{equation}\]

Transforming the parameter $u_2$ into $z$-scores amounts to centering it, so we can call this a centered parameterization.

\[\begin{equation} \mathbf{z}_{u_2} =\frac{\mathbf{u}_{2} - 0}{\tau_{u_2}} \end{equation}\]

where \[\begin{equation} \mathbf{z}_{u_2} \sim \mathit{Normal}(0, 1) \end{equation}\]

Now $\mathbf{z}_{u_2}$ is easier to sample because it doesn’t depend on other parameters (in particular, it is no longer conditional on $\tau$) and its scale is $1$. Once we have sampled this centered parameter, we can derive the actual parameter we care about by carrying out the inverse operation, which is called a non-centered parameterization:

\[\begin{equation} \mathbf{u}_{2} = \mathbf{z}_{u_2} \cdot \tau_{u_2} \end{equation}\]

A question that might be raised here is whether using a non-centered parameterization is always a good idea. Betancourt and Girolami (2015) point out that the extremeness of the dependency depends on the amount of data, and the efficacy of the parameterization depends on the strength of the data (on how informative the data is). When there is enough data, this parameterization is unnecessary and it may be more efficient to use the centered parameterization. However, cases where there is enough data to render this parameterization useless might also be cases where the partial pooling of the hierarchical models may not be needed in the first place. Although data from conventional lab experiments in psychology, psycholinguistics, and related areas seem to benefit from the non-centered parameterization, the jury is still out for larger data sets with thousands of subjects from crowdsourcing websites.

B.6 Cholesky factorization for reparameterizing hierarchical models with correlations between adjustments to different parameters

First, some definitions that we will need below. A matrix is square if the number of rows and columns is identical. A square matrix $A$ is symmetric if $A^T = A$, i.e., if transposing the matrix gives the matrix back. Suppose that $A$ is a known matrix with real numbers. If $\boldsymbol{x}$ is a vector of variables with length $p$ (a $p\times 1$ matrix), then $x^T A x$ is called a quadratic form in $x$ ($x^T A x$ will be a scalar, $1\times 1$). If $x^TAx>0$ for all $x$, then $A$ is a positive definite matrix. If $x^TAx\geq 0$ for all $x\neq0$, then $A$ is positive semi-definite.

We encountered correlation matrices first in section 1.6.3. A correlation matrix is always symmetric, has ones along the diagonal, and real values ranging between $-1$ and $1$ on the off-diagonals. Given a correlation matrix $\mathbf{R_u}$, we can decompose it into a lower triangular matrix $\mathbf{L_u}$ such that $\mathbf{L_u}\mathbf{L_u}^T=\mathbf{R_u}$. The matrix $\mathbf{L_u}$ is called the Cholesky factor of $\mathbf{R_u}$. Intuitively, you can think of $\mathbf{L_u}$ as the matrix equivalent of the square root of $\mathbf{R_u}$. More details on the Cholesky factorization can be found in Gentle (2007).

\[\begin{equation} \mathbf{L_u} = {\begin{pmatrix} L_{11} & 0 \\ L_{21} & L_{22} \end{pmatrix}} \end{equation}\]

For a model without a correlation between adjustments for the intercept and slope as the one in section 9.1.2, we assumed that adjustments $u_{1}$ and $u_{2}$ were generated by two independent normal distributions. But in section 9.1.3, we want to allow the possibility that the adjustments can have a non-zero correlation. We can use the Cholesky factorization to generate correlated random variables in the following way.

Generate uncorrelated vectors, $z_{u_1}$ and $z_{u_2}$, for each vector of adjustments $u_1$ and $u_2$, as sampled from $\mathit{Normal}(0,1)$:

\[z_{u_1} \sim \mathit{Normal}(0,1)\] \[z_{u_2} \sim \mathit{Normal}(0,1)\]

By multiplying the Cholesky factor with the $z$’s, generate a matrix that contains two row vectors of correlated variables (with standard deviation $1$).

\[ \mathbf{L_u}\cdot \mathbf{z_u} = {\begin{pmatrix} L_{11} & 0 \\ L_{21} & L_{22} \end{pmatrix}} {\begin{pmatrix} z_{u_{1,subj=1}} & z_{u_{1,subj=2}} & ... & z_{u_{1,subj=N_{subj}}} \\ z_{u_{2,subj=1}} & z_{u_{2,subj=2}} & ... & z_{u_{2,subj=N_{subj}}} \end{pmatrix}} \]

\[ = {\begin{pmatrix} L_{11} \cdot z_{u_{1,1}} + 0 \cdot z_{u_{2,1}} & ... & L_{11} \cdot z_{u_{1,N_{subj}}} + 0 \cdot z_{u_{2,1}} \\ L_{21} \cdot z_{u_{1,1}} + L_{22} \cdot z_{u_{2,1}} & ... & L_{11} \cdot z_{u_{1,N_{subj}}} + L_{22} \cdot z_{u_{2,N_{subj}}} \end{pmatrix}} \]

A very informal explanation of why this works is that we are making the variable that corresponds to the slope to be a function of a scaled version of the intercept.

The last step is to scale the previous matrix to the desired standard deviation. We define the diagonalized matrix $diag\_matrix(\tau_u)$ as before:

\[ {\begin{pmatrix} \tau_{u_1} & 0 \\ 0 & \tau_{u_2} \end{pmatrix}} \]

Then pre-multiply it by the correlated variables with standard deviation 1 from before:

\[\mathbf{u} = diag\_matrix(\tau_u) \cdot \mathbf{L_u}\cdot \mathbf{z_u} = \]

\[ {\begin{pmatrix} \tau_{u_1} & 0 \\ 0 & \tau_{u_2} \end{pmatrix}} {\begin{pmatrix} L_{11} \cdot z_{u_{1,1}} & ... \\ L_{21} \cdot z_{u_{1,1}} + L_{22} \cdot z_{u_{2,1}} & ... \end{pmatrix}} \]

\[ {\begin{pmatrix} \tau_{u_1} \cdot L_{11} \cdot z_{u_{1,1}} & \tau_{u_1} \cdot L_{11} \cdot z_{u_{1,2}} & ... \\ \tau_{u_2} \cdot (L_{21} \cdot z_{u_{1,1}} + L_{22} \cdot z_{u_{2,1}}) & \tau_{u_2} \cdot (L_{21} \cdot z_{u_{1,2}} + L_{22} \cdot z_{u_{2,2}}) & ... \end{pmatrix}} \]

It might be helpful to see how one would implement this in R:

Let’s assume a correlation of $0.8$.

rho_u <- 0.8
# Correlation matrix
(R_u <- matrix(c(1, rho_u, rho_u, 1), ncol = 2))

##      [,1] [,2]
## [1,]  1.0  0.8
## [2,]  0.8  1.0

# Cholesky factor:
# (Transpose it so that it looks the same as in Stan)
(L_u <- t(chol(R_u)))

##      [,1] [,2]
## [1,]  1.0  0.0
## [2,]  0.8  0.6

# Verify that we recover R_u,
# Recall that %*% indicates matrix multiplication
L_u %*% t(L_u)

##      [,1] [,2]
## [1,]  1.0  0.8
## [2,]  0.8  1.0

Generate uncorrelated z from a standard normal distribution assuming only 10 subjects.

N_subj <- 10
(z_u1 <- rnorm(N_subj, 0, 1))

##  [1] 0.138 0.588 0.680 1.391 0.759 1.432 0.283 0.384 1.389 1.119

(z_u2 <- rnorm(N_subj, 0, 1))

##  [1]  0.826 -0.666 -0.213  0.936  0.876  1.105  1.109  0.832  0.862
## [10] -0.541

Create matrix of correlated parameters.

First, create a matrix with the uncorrelated parameters:

# matrix z_u
(z_u <- matrix(c(z_u1, z_u2), ncol = N_subj, byrow = TRUE))

##       [,1]   [,2]   [,3]  [,4]  [,5] [,6]  [,7]  [,8]  [,9]  [,10]
## [1,] 0.138  0.588  0.680 1.391 0.759 1.43 0.283 0.384 1.389  1.119
## [2,] 0.826 -0.666 -0.213 0.936 0.876 1.11 1.109 0.832 0.862 -0.541

Then, generate correlated parameters by pre-multiplying the $\mathbf{z}_u$ matrix with $\mathbf{L}_u$.

L_u %*% z_u

##       [,1]   [,2]  [,3] [,4]  [,5] [,6]  [,7]  [,8] [,9] [,10]
## [1,] 0.138 0.5884 0.680 1.39 0.759 1.43 0.283 0.384 1.39 1.119
## [2,] 0.606 0.0712 0.416 1.67 1.132 1.81 0.891 0.806 1.63 0.571

Use the following diagonal matrix to scale the z_u.

tau_u1 <- 0.2
tau_u2 <- 0.01
(diag_matrix_tau <- diag(c(tau_u1,  tau_u2)))

##      [,1] [,2]
## [1,]  0.2 0.00
## [2,]  0.0 0.01

Finally, generate the adjustments for each subject u:

(u <- diag_matrix_tau %*% L_u %*% z_u)

##         [,1]     [,2]    [,3]   [,4]   [,5]   [,6]    [,7]
## [1,] 0.02757 0.117677 0.13600 0.2782 0.1518 0.2864 0.05654
## [2,] 0.00606 0.000712 0.00416 0.0167 0.0113 0.0181 0.00891
##         [,8]   [,9]   [,10]
## [1,] 0.07674 0.2777 0.22378
## [2,] 0.00806 0.0163 0.00571

# The rows are correlated, approximately 0.8
cor(u[1, ], u[2, ])

## [1] 0.703

# The variance components can be recovered as well:
sd(u[1, ])

## [1] 0.0974

sd(u[2, ])

## [1] 0.00587

B.7 Different rank visualizations and the `SBC` package.

Implementing the simulation-based calibration algorithm “by hand”, as we did in section 10.2, introduces a new source of potential errors. Fortunately, the R package SBC (Kim et al. 2024) provides tools to validate a Stan model (or any sampling algorithm) by allowing us to run simulation-based calibrations easily. The package is in active development at the moment⁷² and can be installed with the following command.

remotes::install_github("hyunjimoon/SBC")

One of the main advantages of this package is that it provides several ways to visualize the results of the simulation-based calibration procedure; see https://hyunjimoon.github.io/SBC/. Figure B.1 shows rank histograms produced by SBC of a correct model and several different incorrect models. An alternative to rank histograms is to use an empirical cumulative distribution function (ECDF)-based method, as proposed by Säilynoja, Bürkner, and Vehtari (2022). The idea behind this method is that if the ranks produced by the simulation-based calibration algorithm are uniform the ECDF of the ranks should be close to the CDF of a uniform distribution. Figure B.2 shows the difference between the ECDF of the ranks and the CDF of a uniform distribution together with 95% confidence bands (this is the default in the SBC package) for a correct model and different incorrect ones.

Rank histograms produced by the R package SBC showing the outcome one would expect for a correct model and for several different incorrect ones together with 95% confidence bands (these are the default bands in the package).

FIGURE B.1: Rank histograms produced by the R package SBC showing the outcome one would expect for a correct model and for several different incorrect ones together with 95% confidence bands (these are the default bands in the package).

Difference between the perfectly uniform CDF and empirical cumulative distribution function (ECDF) of the ranks produced by the SBC R package together with 95% confidence bands. The figure shows the outcome one would expect for a correct model and for several different incorrect ones.

FIGURE B.2: Difference between the perfectly uniform CDF and empirical cumulative distribution function (ECDF) of the ranks produced by the SBC R package together with 95% confidence bands. The figure shows the outcome one would expect for a correct model and for several different incorrect ones.

References

Betancourt, Michael J., and Mark Girolami. 2015. “Hamiltonian Monte Carlo for Hierarchical Models.” Current Trends in Bayesian Methodology with Applications 79 (30): 2–4.

Gentle, James E. 2007. “Matrix Algebra: Theory, Computations, and Applications in Statistics.” Springer Texts in Statistics 10.

Kim, Shinyoung, Hyunji Moon, Martin Modrák, and Teemu Säilynoja. 2024. SBC: Simulation Based Calibration for Rstan/Cmdstanr Models. https://hyunjimoon.github.io/SBC/.

Säilynoja, Teemu, Paul-Christian Bürkner, and Aki Vehtari. 2022. “Graphical Test for Discrete Uniformity and Its Applications in Goodness-of-Fit Evaluation and Multiple Sample Comparison.” Statistics and Computing 32 (2): 1–21. https://doi.org/https://doi.org/10.1007/s11222-022-10090-6.

Stan Development Team. 2024. “Stan Modeling Language Users Guide and Reference Manual, Version 2.32.” https://mc-stan.org/docs/2_35/.

Even though the package is already fully functional, function names and arguments might change by the time this book is published.↩︎

B Advanced models with Stan - Extended

B.1 What does target do in Stan models?

B.2 Explicitly incrementing the log probability function (target) vs. using the sampling or distribution ~ notation

B.3 An alternative R interface to Stan: cmdstanr