Chapter 8 Introduction to the probabilistic programming language Stan

Stan is a probabilistic programming language for statistical inference written in C++ that can be accessed through several interfaces (e.g., R, Python, etc.). Stan uses an advanced dynamic Hamiltonian Monte Carlo algorithm (Betancourt 2016) based on a variant of the No-U-Turn sampler (known as NUTS: Hoffman and Gelman 2014), which is, in general, more efficient than the traditional Gibbs sampler used in other probabilistic languages such as (Win)BUGS (Lunn et al. 2000) and JAGS (Plummer 2016). In this part of the book, we will focus on the package rstan (Guo et al. 2024) that integrates Stan (Carpenter et al. 2017) with R (R Core Team 2023).

In order to understand how to fit a model in Stan and the difficulties one might face, a minimal understanding of the Stan sampling algorithm is needed. Our descriptions of the technical details of HMC are oversimplifications that suffice for the purposes of this book. Stan takes advantage of the fact that the shape of the posterior distribution is completely determined by the priors and the likelihood; in other words, the shape of the posterior is determined by the analytical form of the unnormalized posterior, which is the upper part of the Bayes rule (abbreviated below as $q(\Theta | y)$). This is because the denominator, or marginal likelihood, “only” constitutes the normalizing constant:

\[\begin{equation} p(\Theta|y) = \cfrac{ p(y|\Theta) \cdot p(\Theta) }{p(y)} \tag{8.1} \end{equation}\]

\[\begin{equation} q(\Theta|y) = p(y|\Theta) \cdot p(\Theta) \tag{8.2} \end{equation}\]

Thus, the unnormalized posterior is proportional ($\propto$) to the posterior distribution:

\[\begin{equation} q(\Theta|y) \propto p(\Theta|y) \tag{8.3} \end{equation}\]

The Stan sampler uses Hamiltonian dynamics and treats the vector of parameters, $\Theta$ (that could range from a vector containing a couple of parameters, e.g., $\langle \mu,\sigma \rangle$, to a vector of hundreds of parameters in hierarchical models), as the position of a frictionless particle that glides on the negative logarithm of the unnormalized posterior. That means that high probability places are valleys and low probability places are peaks in this space.³² Whenever the particle comes to a halt, that location constitutes a sample from the distribution.

However, Stan doesn’t just let the particle glide until it reaches the bottom of this space. If we let that happen, the particle would stop at the mode of the posterior distribution, and one would not be able to obtain samples from the rest of the distribution. Stan uses a complex algorithm to determine the weight of the particle and the momentum applied to it, as well as when to stop the particle trajectory to take a sample. Because we need to know the velocity of this particle, Stan needs to be able to calculate the derivative of the log unnormalized posterior with respect to the parameters (recall that velocity is the first derivative with respect to the position). This means that if the parameter space is differentiable and relatively smooth (i.e., does not have any big break or sharp angle), and if the tuning parameters of the Stan algorithm are well adjusted–as should happen in the warm-up period–these samples are going to represent samples of the true posterior distribution. Bear in mind that the geometry of the posterior has a big influence on whether the algorithm will converge (quickly) or not: If the space is very flat, because there isn’t much data and the priors are not informative, then the particle may need to glide for a long time before it gets to a high probability area, that is a valley; if there are several valleys (multimodality) the particle may never leave the vicinity of one of them; and if the space is funnel shaped, the particle may never explore the funnel. One of the reasons for the difficulties in exploring complicated spaces is that the continuous path of the “particle” is discretized and divided into steps, and the step size is optimized for the entire posterior space. In spaces that are too complex, such as a funnel, a step size might be too small to explore the wide part of the funnel, but too large to explore the narrow part; we will deal with this problem in section 9.1.2.

Although our following example assumes a vector of two parameters and thus a simple geometry, real world examples can easily have hundreds of parameters defining an unnormalized posterior space with hundreds of dimensions.

One question that might arise here is the following: Given that we already know the shape of the posterior, why do we need samples? After all, the posterior is just the unnormalized posterior multiplied by some number, the normalizing constant.

To make this discussion concrete, let’s say that we have a subject that participates in a memory test, and in each trial we get a noisy score from their true working memory score. We assume that at each trial, the score is elicited with normally distributed noise. If we want to estimate the score and how much the noise makes it vary from trial to trial, we are assuming a normal likelihood and we want to estimate its mean and standard deviation.

We will use simulate data produced by a normal distribution with a true mean of $3$ and a true standard deviation of $10$ (the units are arbitrary):

y <- rnorm(n = 100, mean = 3, sd = 10)
head(y)

## [1]  -4.41 -10.92  21.39   4.23   5.53   4.65

As always, given our prior knowledge, we decide on priors. In this case, we use a log-normal prior for the standard deviation, $\sigma$, since it can only be positive, but except for that, the prior distributions are quite arbitrary in this example.

\[\begin{equation} \begin{aligned} \mu &\sim \mathit{Normal}(0, 20)\\ \sigma &\sim \mathit{LogNormal}(3, 1) \end{aligned} \tag{8.4} \end{equation}\]

The unnormalized posterior will be the product of the likelihood of each data point times the prior for each parameter:

\[\begin{equation} q(\mu, \sigma |y) = \prod_n^{100} \mathit{Normal}(y_n|\mu, \sigma) \cdot \mathit{Normal}(\mu | 0, 20) \cdot \mathit{LogNormal}(\sigma | 3, 1) \tag{8.5} \end{equation}\]

where $y = {-4.411, -10.918, \ldots}$

We can also define the unnormalized posterior, $q(\cdot)$, as a function in R:

q <- function(mu, sigma, y) {
  dnorm(x = mu, mean = 0, sd = 20) *
    dlnorm(x = sigma, mean = 3, sd = 1) *
    prod(dnorm(x = y, mean = mu, sd = sigma))
}

For example, if we want to know the unnormalized posterior density for the vector of parameters $\langle \mu,\sigma \rangle = \langle 0, 5 \rangle$, we do the following:

q(mu = 0, sigma = 5, y = y)

## [1] 1.07e-217

The shape of the unnormalized posterior density is completely defined and it will look like Figure 8.1.

FIGURE 8.1: The unnormalized posterior defined by Equation (8.5).

Why is the shape of the unnormalized posterior density not too useful for us? The main reason is that unless we already know which probability distribution we are dealing with (e.g., normal, Bernoulli, etc.) or we can easily integrate it (which can only be done in simpler cases), we cannot do much with the analytical form of the unnormalized posterior: We cannot calculate credible intervals, or know how likely it is that the true score is above or below zero, and even the mean of the posterior is impossible to calculate. This is because the unnormalized posterior distribution represents the shape of the posterior distribution. With just the shape of an unknown distribution, we can only answer the following question: What is the most (or least) likely value of the vector of parameters? We can answer this question by searching for the highest (or lowest) place in that shape. This leads us to the maximum a posteriori (MAP) estimate (i.e., the highest point in Figure 8.1, or the lowest point of the negative log unnormalized log posterior), which is in a sense the “Bayesian counterpart” of the penalized maximum likelihood estimate (MLE); see Green (1998) and Scheipl, Kneib, and Fahrmeir (2013) for a discussion of penalized likelihood and its connection to Bayesian inference. However, the MAP is not truly Bayesian since it doesn’t take into account the uncertainty of the posterior. Even calculating the mode is not always trivial. In simple cases such as this one, one can calculate it analytically; but in more complex cases, relatively complicated algorithms are needed. (If we can recognize the shape as a distribution, we are in a different situation. In that case, we might know already the formulas for the expectation, variance, etc. This is what we did in chapter 2, but this is an unusual situation in realistic analyses.) As we mentioned before, if we want to get posterior density values, we need the denominator of the Bayes rule (or marginal likelihood), $p(y)$, which requires integrating the unnormalized posterior. Even this is not too useful if we want to communicate findings: almost every summary statistic requires us to solve more integrals, and except for a handful of cases, these integrals might not have an analytical solution.

If we want to be able to calculate summary statistics of the posterior distribution (mean, quantiles, etc.), we are going to need samples from this distribution. This is because with enough samples of a probability distribution, we can achieve very good approximations of summary statistics.

Stan will take care of returning samples from the posterior distribution, if the log unnormalized posterior distribution is differentiable and can be expressed as follows:³³

\[\begin{equation} \log(q(\Theta|y)) = \sum_n \log(p(y_n|\Theta)) + \sum_{par} \log(p(\Theta_{par})) \tag{8.6} \end{equation}\]

where $n$ indicates each data point and $par$ each parameter. In our case, this corresponds to the following:

\[\begin{equation} \begin{aligned} \log(q(\mu, \sigma |y)) =& \sum_n^{100} \log(Normal(y_n|\mu, \sigma)) + \log(Normal(\mu | 0, 20)) \\ &+ \log(LogNormal(\sigma | 3, 1)) \end{aligned} \tag{8.7} \end{equation}\]

In the following sections, we’ll see how we can implement this model and many others in Stan.

8.1 Stan syntax

A Stan program is usually saved as a .stan file and accessed through R (or other interfaces) and it is organized into a sequence of optional and obligatory blocks, which must be written in order. The Stan language is different from R and it is loosely based on C++; one important aspect to pay attention to is that every statement ends in a semi-colon, ;. Blocks ({}) do not end in semi-colons. Some functions in Stan are written in the same way as in R (e.g., mean, sum, max, min). But some are different; when in doubt, Stan documentation can be extremely helpful. In addition, the package rstan provides the function lookup() to look up for translations of functions. For example, in section 4.3, we saw that the R function plogis() is needed to convert from log-odds to probability space. If we need it in a Stan program, we can look for it in the following way:

lookup(plogis)

##          StanFunction                                Arguments
## 262         inv_logit                                (T);(int)
## 307     log_inv_logit                                (T);(int)
## 323 logistic_ccdf_log (real, real, T);(vector, vector, vector)
## 324      logistic_cdf (real, real, T);(vector, vector, vector)
## 325  logistic_cdf_log (real, real, T);(vector, vector, vector)
## 326    logistic_lccdf (real, real, T);(vector, vector, vector)
## 327     logistic_lcdf (real, real, T);(vector, vector, vector)
##     ReturnType
## 262     T;real
## 307     T;real
## 323     T;real
## 324     T;real
## 325     T;real
## 326     T;real
## 327     T;real

There are three columns in the output of this call. The first one indicates Stan function names, the second one their arguments with their type, and the third one the type they return. Unlike R, Stan is strict with the type of the variables.³⁴ In order to decide which function to use, it is necessary to look at the Stan documentation and find the function that matches our specific needs (for plogis, the corresponding function would be inv_logit()).

Another important difference from R is that every variable needs to be declared with its type (real, integer, vector, matrix, etc.). The next two sections exemplify these details through basic Stan programs.

8.2 A first simple example with Stan: Normal likelihood

Let’s fit a Stan model to estimate the simple example given at the introduction of this chapter, where we simulate data in R from a normal distribution with a true mean of 3 and a true standard deviation of 10:

y <- rnorm(n = 100, mean = 3, sd = 10)

As mentioned earlier, Stan code is organized in blocks. The first block indicates what constitutes data for the model:

data {
  int<lower = 1> N;  // Total number of trials
  vector[N] y;  // Score in each trial
}

The variable of type int (integer) represents the number of trials. In addition to the type, some constraints can be indicated with lower and upper. In this case, N can’t be smaller than 1. These constraints serve as a sanity check; if they are not satisfied, we get an error and the model won’t run. The data are stored in a vector of length N, unlike R, vectors (and matrices and arrays) need to be defined with their dimensions. Comments are indicated with // rather than #.

The next block indicates the parameters of the model:

parameters {
  real mu;
  real<lower = 0> sigma;
}

The two parameters are real numbers, and sigma is constrained to be positive.

Finally, we indicate the prior distributions and likelihood functions in the model block:

model {
  // Priors:
  target += normal_lpdf(mu | 0, 20);
  target += lognormal_lpdf(sigma | 3, 1);
  // Likelihood:
  for(i in 1:N)
    target += normal_lpdf(y[i] | mu, sigma);
}

The variable target is a reserved word in Stan; every statement with target += adds terms to the unnormalized log posterior density. We do this because adding to the unnormalized log posterior amounts to multiplying a term in the numerator of the unnormalized posterior. As explained earlier, Stan uses the shape of the unnormalized posterior to sample from the actual posterior distribution. See online section B.1 for a more detailed explanation, and see online section B.2 for alternative notations.

We didn’t use curly brackets with the for-loop; this is a common practice if the for-loop has only one line, but brackets can be added and are obligatory if the for-loop spans several lines.

It’s also possible to avoid the for-loop since many functions are vectorized in Stan:

model {
  // Priors:
  target += normal_lpdf(mu | 0, 20);
  target += lognormal_lpdf(sigma | 3, 1);
  // Likelihood:
  target += normal_lpdf(y | mu, sigma);
}

The for-loop and vectorized versions give us the same output: The for-loop version evaluated the log-likelihood at each value of y and added it to target. The vectorized version does not create a vector of log-likelihoods; instead, it sums up the log-likelihood evaluated at each element of y and then it adds that to target.

The complete model looks like this:

data {
  int<lower = 1> N;  // Total number of trials
  vector[N] y;  // Score in each trial
}
parameters {
  real mu;
  real<lower = 0> sigma;
}
model {
  // Priors:
  target += normal_lpdf(mu | 0, 20);
  target += lognormal_lpdf(sigma | 3, 1);
  // Likelihood:
  target += normal_lpdf(y | mu, sigma);
}

You can save the above code as normal.stan. Alternatively, you can use the version stored in the package bcogsci. (Typing ?stan-normal in the R console provides some documentation for the model.) You can access the code of the models of this book by using system.file("stan_models", "name_of_the_model.stan", package = "bcogsci").

normal <- system.file("stan_models",
                      "normal.stan",
                      package = "bcogsci")

This command just points to a text file that the package bcogsci stores on your computer. You can open it to read the code (with any text editor, or readLines() in R). You’ll need to compile this code and run it with stan().

Stan requires the data to be in a list object in R. Below, we fit the model with the default number of chains and iterations.

y <- rnorm(n = 100, mean = 3, sd = 10)
lst_score_data <- list(y = y, N = length(y))
# Fit the model with the default values of number of
# chains and iterations: chains = 4, iter = 2000
fit_score <- stan(normal, data = lst_score_data)
# alternatively:
# stan("normal.stan", data = lst_score_data)

Inspect how well the chains mixed in Figure 8.2. The chains for each parameter should look like a “fat hairy caterpillar” (Lunn et al. 2012); see section 3.2.1.2 for a brief discussion about convergence.

traceplot(fit_score, pars = c("mu", "sigma"))

FIGURE 8.2: Traceplots of mu and sigma from the model fit_score.

We can see a summary of the posterior by either printing out the model fit, or by plotting it. The summary displayed by the function print includes means, standard deviations (sd), quantiles, Monte Carlo standard errors for the mean of the posterior (se_mean), split Rhats, and effective sample sizes (n_eff). The summaries are computed after removing the warmup and merging together all chains. The se_mean is unrelated to the se of an estimate in the parallel frequentist model. Similarly to a large effective sample size, small Monte Carlo standard errors indicate a successful sampling procedure: with a large value of n_eff and a small value for se_mean we can be relatively sure of the reliability of the mean of the posterior. However, what constitutes a large or small se_mean is harder to define (see Vehtari et al. 2021 for a more extensive discussion).³⁵

print(fit_score, pars = c("mu", "sigma"))

## Inference for Stan model: anon_model.
## 4 chains, each with iter=2000; warmup=1000; thin=1; 
## post-warmup draws per chain=1000, total post-warmup draws=4000.
## 
##       mean se_mean   sd 2.5%  25%  50%  75% 97.5% n_eff Rhat
## mu    3.91    0.02 0.92 2.06 3.29 3.91 4.53   5.7  3707    1
## sigma 9.23    0.01 0.66 8.07 8.75 9.19 9.65  10.7  4026    1
## 
## Samples were drawn using NUTS(diag_e) at Mon Jan 13 08:27:33 2025.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

After transforming the stanfit object into a data frame, it’s possible to provide summary plots as the one shown in Figure 8.3. The package bayesplot (Gabry and Mahr 2024) is a wrapper around ggplot2 (Wickham, Chang, et al. 2024) and has several convenient functions to plot the samples. Bayesplot functions for posterior summaries start with mcmc_:

df_fit_score <- as.data.frame(fit_score)
mcmc_hist(df_fit_score, pars = c("mu", "sigma"))

FIGURE 8.3: Histograms of the samples of the posterior distributions of mu and sigma from the model fit_score.

There are also several ways to get the samples for other summaries or customized plots, depending on whether we want a list, a data frame, or an array.³⁶

# The function extract from rstan is sometimes overwritten by
# a tidyverse version with the same name, so make sure that 
# you are using the right function:
rstan::extract(fit_score) %>%
  str()

## List of 3
##  $ mu   : num [1:4000(1d)] 3.89 5.04 3.8 3.91 5.55 ...
##   ..- attr(*, "dimnames")=List of 1
##   .. ..$ iterations: NULL
##  $ sigma: num [1:4000(1d)] 8.01 9.37 8.47 8.39 9.66 ...
##   ..- attr(*, "dimnames")=List of 1
##   .. ..$ iterations: NULL
##  $ lp__ : num [1:4000(1d)] -370 -369 -368 -368 -369 ...
##   ..- attr(*, "dimnames")=List of 1
##   .. ..$ iterations: NULL

as.data.frame(fit_score) %>%
  str(list.len = 5)

## 'data.frame':    4000 obs. of  3 variables:
##  $ mu   : num  4.92 2.6 4.91 3.07 4.02 ...
##  $ sigma: num  10.04 8.51 9.97 9.39 9.04 ...
##  $ lp__ : num  -369 -369 -369 -368 -368 ...

as.array(fit_score) %>%
  str()

##  num [1:1000, 1:4, 1:3] 4.92 2.6 4.91 3.07 4.02 ...
##  - attr(*, "dimnames")=List of 3
##   ..$ iterations: NULL
##   ..$ chains    : chr [1:4] "chain:1" "chain:2" "chain:3" "chain:4"
##   ..$ parameters: chr [1:3] "mu" "sigma" "lp__"

8.3 Another simple example: Cloze probability with Stan with the binomial likelihood

Let’s fit a Stan model (binomial_cloze.stan) to estimate the cloze probability of a word given its context: that is, what is the probability of an upcoming word given its previous context; the model that was detailed in section 2.2 and was fit in section 3.1. We want to estimate $\theta$, the cloze probability of producing “umbrella,” given the following data: “umbrella” was produced in $80$ out of $100$ trials. We assume a binomial distribution as the likelihood function, and $\mathit{Beta}(a=4,b=4)$ as a prior distribution for the cloze probability.

data {
  // Total number of answers
  int<lower = 1> N;
  // Number of times umbrella was answered:
  int<lower = 0, upper = N> k;
}
parameters {
  // theta is a probability, must be constrained between 0 and 1
  real<lower = 0, upper = 1> theta;
}
model {
  // Prior on theta:
  target += beta_lpdf(theta | 4, 4);
  // Likelihood:
  target += binomial_lpmf(k | N, theta);
}

There is only one parameter in this model, cloze probability represented with the parameter theta, which is a real number constrained between $0$ and $1$. Another difference between this and the previous example is that the likelihood function ends with _lpmf rather than with _lpdf. This is because Stan differentiates between distributions of continuous variables, i.e, probability density functions (PDFs), and distributions of discrete variables, i.e., probability mass functions (PMFs).

lst_cloze_data <- list(k = 80, N = 100)
binomial_cloze <- system.file("stan_models",
                              "binomial_cloze.stan",
                              package = "bcogsci")
fit_cloze <- stan(binomial_cloze, data = lst_cloze_data)

Print the summary of the posterior distribution of $\theta$ below, and show its posterior distribution graphically (see Figure 8.4):

print(fit_cloze, pars = c("theta"))

##       mean 2.5% 97.5% n_eff Rhat
## theta 0.78  0.7  0.85  1371    1

df_fit_cloze <- as.data.frame(fit_cloze)
mcmc_dens(df_fit_cloze, pars = "theta") +
  geom_vline(xintercept = mean(df_fit_cloze$theta))

$The posterior distribution of the cloze probability of umbrella (the parameter $\theta$).$

FIGURE 8.4: The posterior distribution of the cloze probability of umbrella (the parameter $\theta$).

8.4 Regression models in Stan

In the following sections, we will revisit and expand on some of the examples that were fit with brms in chapter 4.

8.4.1 A first linear regression in Stan: Does attentional load affect pupil size?

As in section 4.1, we focus on the effect of cognitive load on one subject’s pupil size with a subset of the data from Wahn et al. (2016). We use the following likelihood and priors. For details about our decision on priors and likelihood, see section 4.1.

\[\begin{equation} \begin{aligned} p\_size_n &\sim \mathit{Normal}(\alpha + c\_load_n \cdot \beta,\sigma) \\ \alpha &\sim \mathit{Normal}(1000, 500) \\ \beta &\sim \mathit{Normal}(0, 100) \\ \sigma &\sim \mathit{Normal}_+(0, 1000) \end{aligned} \end{equation}\]

The Stan model pupil.stan is as follows:

data {
  int<lower=1> N;
  vector[N] p_size;
  vector[N] c_load;
}
parameters {
  real alpha;
  real beta;
  real<lower = 0> sigma;
}
model {
  // priors:
  target += normal_lpdf(alpha | 1000, 500);
  target += normal_lpdf(beta | 0, 100);
  target += normal_lpdf(sigma | 0, 1000)
    - normal_lccdf(0 | 0, 1000);
  // likelihood
  target += normal_lpdf(p_size | alpha + c_load * beta, sigma);
}

Because we are fitting a regression, we use the location ($\mu$) of the likelihood function to regress p_size with the following equation alpha + c_load * beta, where both p_size and c_load are vectors defined in the data block. The following line accumulates the log-likelihood of every observation:

target += normal_lpdf(p_size | alpha + c_load * beta, sigma);

This is equivalent to and slightly faster than the following lines:

for(n in 1:N)
    target += normal_lpdf(p_size[n] | alpha + c_load[n] * beta, sigma);

A statement that requires some explanation is the following:

target += normal_lpdf(sigma | 0, 1000)
    - normal_lccdf(0 | 0, 1000);

As in our original example in section 4.1, we are assuming a truncated normal distribution as a prior for $\sigma$. Not only are we setting a lower boundary to the parameter with lower = 0, but we are also “correcting” its prior distribution by subtracting normal_lccdf(0 | 0, 1000), where lccdf stands for log complement of a cumulative distribution function. Once we add a lower boundary, the probability mass under half of the “regular” normal distribution should be one, that is, when we integrate from zero (rather than from minus infinity) to infinity. As discussed in online section A.2, we need to normalize the PDF by dividing it by the difference of its CDF evaluated in the new boundaries ($a = 0$ and $b = \infty$ in our case):

\[\begin{equation} f_{[a,b]}(x) = \frac{f(x)}{F(b) - F(a)} \tag{8.8} \end{equation}\]

where $f$ is the PDF and $F$ the CDF.

This equation in log-space is:

\[\begin{equation} log(f_{[a,b]}(x)) = log(f(x)) - log(F(b) - F(a)) \tag{8.9} \end{equation}\]

In Stan $\log(f(x))$ corresponds to normal_lpdf(x |...), and log(F(x)) to normal_lcdf(x|...). Because in our example $b=\infty$, $F(b) = 1$, we are dealing with the complement of the log CDF evaluated at $a =0$, $\log(1 - F(0))$, that is why we use normal_lccdf(0 | ...) (notice the double c; this symbol represents the complement of the CDF).

To be able to fit the model, Stan requires the data to be input as a list: First, load the data and center the dependent variable in a data frame, then create a list, and finally fit the model.

df_pupil <- df_pupil %>%
  mutate(c_load = load - mean(load))

ls_pupil <- list(p_size = df_pupil$p_size,
                 c_load = df_pupil$c_load,
                 N = nrow(df_pupil))
pupil <- system.file("stan_models",
                     "pupil.stan",
                     package = "bcogsci")
fit_pupil <- stan(pupil, data = ls_pupil)

Check the traceplots (Figure 8.5).

traceplot(fit_pupil, pars = c("alpha", "beta", "sigma"))

FIGURE 8.5: Traceplots of alpha, beta, and sigma from the model fit_pupil.

Examine some summaries of the marginal posterior distributions of the parameters of interest:

print(fit_pupil, pars = c("alpha", "beta", "sigma"))

##        mean  2.5% 97.5% n_eff Rhat
## alpha 701.6 661.5 743.1  3299    1
## beta   33.6  10.4  56.8  3207    1
## sigma 128.8 103.4 161.7  2908    1

Plot the posterior distributions (Figure 8.6).

df_fit_pupil <- as.data.frame(fit_pupil)
mcmc_hist(fit_pupil, pars = c("alpha", "beta", "sigma"))

FIGURE 8.6: Histograms of the posterior samples of alpha, beta, and sigma from the model fit_pupil.

To determine the probability that the posterior for beta is larger than zero given the model and data, examine the proportion of samples above zero:

# We are using df_fit_pupil and not the "raw" Stanfit object.
mean(df_fit_pupil$beta > 0)

## [1] 0.997

To generate prior or posterior predictive distributions, we can create our own functions in R with the purrr function map_dfr() (or a for-loop) as we did in section 4.2 with the function lognormal_model_pred(). Alternatively, we can use the generated quantities block in our model:

data {
  int<lower = 1> N;
  vector[N] c_load;
  int<lower= 0, upper = 1> onlyprior;
  vector[N] p_size;
}
parameters {
  real alpha;
  real beta;
  real<lower = 0> sigma;
}
model {
  // priors including all constants
  target += normal_lpdf(alpha | 1000, 500);
  target += normal_lpdf(beta | 0, 100);
  target += normal_lpdf(sigma | 0, 1000)
    - normal_lccdf(0 | 0, 1000);
  if (!onlyprior)
    target += normal_lpdf(p_size | alpha + c_load * beta, sigma);
}
generated quantities {
  array[N] real p_size_pred;
  p_size_pred = normal_rng(alpha + c_load * beta, sigma);
}

For most of the probability functions, there is a matching pseudorandom number generator (PRNG) with the suffix _rng. Here we are using the vectorized function normal_rng. Once p_size_pred is declared as an array of size N, the following statement generates $N$ predictions (for each iteration of the sampler):

p_size_pred = normal_rng(alpha + c_load * beta, sigma);

At the moment not all the PRNG are vectorized, but the ones that are only allow for arrays and, confusingly enough, not vectors. We define an array by indicating array, between brackets, the length of each dimension, then the type, and finally the name of the variable. For example, to define an array of real numbers with three dimensions of length 6, 7, and 10, we write array[6, 7, 10] real var.³⁷ Vectors and matrices are also valid types for an array. See online section B.4 for more about the difference between arrays and vectors, and other algebra types. We also included a data variable called onlyprior, this is an integer that can only be set to $1$ (TRUE) or $0$ (FALSE). When onlyprior == 1, the likelihood is omitted from the model, p_size is ignored, and p_size_pred is the prior predictive distribution. When onlyprior == 0, the likelihood is incorporated in the model (as it is in the original code pupil.stan) using p_size, and p_size_pred is the posterior predictive distribution.

If we want posterior predictive distributions, we fit the model to the data and set onlyprior = 0, if we want prior predictive distributions, we sample from the priors and set onlyprior = 1. Then we use bayesplot functions to visualize predictive checks.

For posterior predictive checks, we would do the following:

ls_pupil <- list(onlyprior = 0,
                 p_size = df_pupil$p_size,
                 c_load = df_pupil$c_load,
                 N = nrow(df_pupil))
pupil_gen <- system.file("stan_models",
                         "pupil_gen.stan",
                         package = "bcogsci")
fit_pupil <- stan(file = pupil_gen, data = ls_pupil)

Store the predicted pupil sizes in yrep_pupil. This variable contains an $N_{samples} \times N_{observations}$ matrix, that is, each row of the matrix is a draw from the posterior predictive distribution, i.e., a vector with one element for each of the data points in y.

yrep_pupil <- extract(fit_pupil)$p_size_pred
dim(yrep_pupil)

## [1] 4000   41

Predictive checks functions in bayesplot (starting with ppc_) require a vector with the observations in the first argument and a matrix with the predictive distribution as its second argument. As an example, in Figure 8.7 we use an overlay of densities and we draw only $50$ elements (that is $50$ predicted data sets).

ppc_dens_overlay(df_pupil$p_size, yrep = yrep_pupil[1:50, ])

FIGURE 8.7: A posterior predictive check showing 50 predicted density plots from the model fit_pupil against the observed data.

For prior predictive distributions, we simply set onlyprior = 1. The observations (p_size) are ignored by the model, but are required by the data block in Stan. If we haven’t collected data yet, we could include a vector of zeros.

ls_pupil_prior <- list(onlyprior = 1,
                       p_size = df_pupil$p_size,
                       # or: p_size = rep(0, nrow(df_pupil)),
                       c_load = df_pupil$c_load,
                       N = nrow(df_pupil))
prior_pupil <- stan(pupil_gen, data = ls_pupil_prior,
                    control = list(adapt_delta = 0.99))

To avoid divergent transitions, increase the adapt_delta parameter’s default value from $0.8$ to $0.99$. It is important to highlight that we cannot safely ignore the warnings of the above model, even if we are not fitting data. This is so because in practice one is still sampling a density using Hamiltonian Monte Carlo, and thus the prior sampling process can break in the same ways as the posterior sampling process. Prior predictive distributions as the one shown in Figure 8.8 can be plot with ppd_dens_overlay() (and in general with functions starting with ppd_ which don’t require an argument with data).

yrep_prior_pupil <- extract(prior_pupil)$p_size_pred
ppd_dens_overlay(yrep_prior_pupil[1:50, ])

$Prior predictive distribution showing $50$ predicted density plots from the model fit_pupil.$

FIGURE 8.8: Prior predictive distribution showing $50$ predicted density plots from the model fit_pupil.

8.4.2 Interactions in Stan: Does attentional load interact with trial number affecting pupil size?

We’ll expand the previous model to also include the effect of (centered) trial and its interaction with cognitive load on one subject’s pupil size. Our new likelihood is as follows:

\[\begin{equation} p\_size_n \sim \mathit{Normal}(\alpha + c\_load_n \cdot \beta_1 + c\_trial \cdot \beta_2 + c\_load \cdot c\_trial \cdot \beta_3, \sigma) \end{equation}\]

Define priors for all the new $\beta$s. Since we don’t have more information about the new predictors, they are sampled from identical prior distributions:

\[\begin{equation} \begin{aligned} \alpha &\sim \mathit{Normal}(1000, 500) \\ \beta_1 &\sim \mathit{Normal}(0, 100) \\ \beta_2 &\sim \mathit{Normal}(0, 100) \\ \beta_3 &\sim \mathit{Normal}(0, 100) \\ \sigma &\sim \mathit{Normal}_+(0, 1000) \end{aligned} \end{equation}\]

The following Stan model, pupil_int1.stan, is the direct translation of the new priors and likelihood.

data {
  int<lower = 1> N;
  vector[N] c_load;
  vector[N] c_trial;
  vector[N] p_size;
}
parameters {
  real alpha;
  real beta1;
  real beta2;
  real beta3;
  real<lower = 0> sigma;
}
model {
  // priors including all constants
  target += normal_lpdf(alpha | 1000, 500);
  target += normal_lpdf(beta1 | 0, 100);
  target += normal_lpdf(beta2 | 0, 100);
  target += normal_lpdf(beta3 | 0, 100);
  target += normal_lpdf(sigma | 0, 1000)
    - normal_lccdf(0 | 0, 1000);
  target += normal_lpdf(p_size | alpha + c_load * beta1 +
                                 c_trial * beta2 +
                                 c_load .* c_trial * beta3, sigma);
}

When there are matrices or vectors involved, * indicates matrix multiplication whereas .* indicates element-wise multiplication; in R, %*% indicates matrix multiplication whereas * indicates element-wise multiplication.

There is, however, an alternative notation that can simplify our code. In the following likelihood, $p\_size$ is a vector of N observations (in this case 41), $X$ is the model matrix with a dimension of $N \times N_{pred}$ (in this case $41 \times 3$), and $\beta$ a vector of $N_{pred}$ (in this case, 3) rows. Assuming that $\beta$ is a vector, we indicate with one line that each parameter $\beta_n$ is sampled from identical prior distributions.

\[\begin{equation} \begin{aligned} p\_size &\sim \mathit{Normal}(\alpha + X \cdot \beta,\sigma)\\ \beta &\sim \mathit{Normal}(0, 100) \\ \sigma &\sim \mathit{Normal}_+(0, 1000) \end{aligned} \end{equation}\]

The translation into Stan code is the following:

data {
  int<lower = 1> N;
  int<lower = 0> K;   // number of predictors
  matrix[N, K] X;   // model matrix
  vector[N] p_size;
}
parameters {
  real alpha;
  vector[K] beta;
  real<lower = 0> sigma;
}
model {
  // priors including all constants
  target += normal_lpdf(alpha | 1000, 500);
  target += normal_lpdf(beta | 0, 100);
  target += normal_lpdf(sigma | 0, 1000)
    - normal_lccdf(0 | 0, 1000);
  target += normal_lpdf(p_size | alpha + X * beta, sigma);
}

For some likelihood functions, Stan provides a more efficient implementation of the linear regression than the one manually written in the previous code. It’s critical to understand that, in general, a more efficient implementation should not only be faster, but should also achieve the same number of effective samples (or more) than a less efficient implementation (and should also show convergence). In this case, we can achieve that using _glm functions. We can replace the last line with the following statement (the order of the arguments is important):

target += normal_id_glm_lpdf(p_size | X, alpha, beta, sigma);

The most optimized model, pupil_int.stan, includes this last statement. We prepare the data as follows: First create a centered version of trial, c_trial and load c_load, then use the function model.matrix to create the X matrix that contains in each column our predictors and omits the intercept with 0 +.

df_pupil <- df_pupil %>%
  mutate(c_trial = trial - mean(trial),
         c_load = load - mean(load))
X <- model.matrix(~ 0 + c_load * c_trial, df_pupil)
ls_pupil_X <- list(p_size = df_pupil$p_size,
                   X = X,
                   K = ncol(X),
                   N = nrow(df_pupil))

pupil_int <- system.file("stan_models",
                         "pupil_int.stan",
                         package = "bcogsci")
fit_pupil_int <- stan(pupil_int, data = ls_pupil_X)

print(fit_pupil_int, pars = c("alpha", "beta", "sigma"))

##           mean   2.5%  97.5% n_eff Rhat
## alpha   699.63 668.03 731.54  4941    1
## beta[1]  31.26  11.88  49.85  4605    1
## beta[2]  -5.79  -8.55  -2.87  5410    1
## beta[3]  -1.82  -3.51  -0.20  4329    1
## sigma   104.48  82.88 133.55  3732    1

In Figure 8.9, we plot the 95% CrI of the parameters of interest. We use regex_pars, rather than pars, because we want to capture beta[1], beta[2], and beta[3]; regex_pars use regular expressions to select the parameters (for information about regular expressions in R, see https://stat.ethz.ch/R-manual/R-devel/library/base/html/regex.html).

df_fit_pupil_int <- as.data.frame(fit_pupil_int)
mcmc_intervals(fit_pupil_int,
               regex_pars = "beta",
               prob_outer = .95,
               prob = .8,
               point_est = "mean")

FIGURE 8.9: The means and 95% CrIs of the effect of load, beta[1], the effect of trial, beta[2], and their interaction, beta[3].

8.4.3 Logistic regression in Stan: Does set size and trial affect free recall?

We revisit and expand on the analysis presented in section 4.3 of a subset of the data from Oberauer (2019). In this example, we will investigate whether the length of a list and trial number affect the probability of correctly recalling a word.

As in section 4.3, we assume a Bernoulli likelihood with a logit link function, and the following priors (recall that the logistic function is the inverse of the logit).

\[\begin{equation} \begin{aligned} correct_n &\sim \mathit{Bernoulli}( \mathit{logistic}(\alpha + X \cdot \beta))\\ \alpha &\sim \mathit{Normal}(0, 1.5) \\ \beta &\sim \mathit{Normal}(0, 0.1) \end{aligned} \end{equation}\]

Where $\beta$ is a vector of size $K = 2$, $\langle \beta_0, \beta_1 \rangle$. Below in recall.stan we present the most efficient way to code this in Stan.

data {
  int<lower = 1> N;
  int<lower=0> K;   // number of predictors
  matrix[N, K] X;   // model matrix
  array[N] int correct;
}
parameters {
  real alpha;
  vector[K] beta;
}
model {
  // priors including all constants
  target += normal_lpdf(alpha | 0, 1.5);
  target += normal_lpdf(beta | 0, .1);
  target += bernoulli_logit_glm_lpmf(correct | X, alpha, beta);
}

The dependent variable, correct, is an array of integers rather than a vector; this is because vectors are always composed of real numbers, but the Bernoulli likelihood only accepts the integers $1$ or $0$. As in the previous example, we are taking advantage of the _glm functions. A less efficient but more transparent option would be to replace the last statement with:

target += bernoulli_logit_lpmf(correct | alpha + X * beta);

We might want to use bernoulli_logit_lpmf() if we want to define a non-linear relationship between the predictors that are outside the generalized linear model framework. One example would be the following:

target += bernoulli_logit_lpmf(correct| alpha + exp(X * beta));

Another more flexible possibility when we want to indicate a Bernoulli likelihood is to use bernoulli_lpmf() and add the link manually. The last statement of recall.stan would become the following:

target += bernoulli_lpmf(correct| inv_logit(alpha + X * beta));

The function bernoulli_lpmf() can be useful if one wants to try other link functions; see online exercise G.8.4.

Finally, the most transparent form (but less efficient) would be the following for-loop:

for (n in 1:N)
  target += bernoulli_lpmf(correct[n] | inv_logit(alpha + X[n] * beta));

To fit the model as recall.stan, prepare the data by centering the trial number first:

df_recall <- df_recall %>%
  mutate(c_set_size = set_size - mean(set_size),
         c_trial = trial - mean(trial))

Next, create the model matrix, $X$, and prepare the data as a list. As in section 8.4.2, exclude the intercept from the matrix X using 0 +.... This is because in the Stan code that we are using, the intercept is kept separate from the model matrix $X$, which only has the contrast coding for the slope parameters. The approach we take here is equivalent to defining a model matrix whose first column is a vector of ones.

X <- model.matrix(~ 0 + c_set_size * c_trial, df_recall)
ls_recall <- list(correct = df_recall$correct,
                  X = X,
                  K = ncol(X),
                  N = nrow(df_recall))

recall <- system.file("stan_models",
                      "recall.stan",
                      package = "bcogsci")
fit_recall <- stan(recall, data = ls_recall)

After fitting the model, we can print and plot summaries of the posterior distribution.

print(fit_recall, pars = c("alpha", "beta"))

##          mean  2.5% 97.5% n_eff Rhat
## alpha    1.99  1.40  2.62  3289    1
## beta[1] -0.19 -0.35 -0.02  4021    1
## beta[2] -0.02 -0.09  0.05  3122    1
## beta[3]  0.00 -0.03  0.04  3533    1

In Figure 8.10(a), we plot the 95% CrI of the parameters of interest.

df_fit_recall <- as.data.frame(fit_recall)
mcmc_intervals(df_fit_recall,
               regex_pars = "beta",
               prob_outer = .95,
               prob = .8,
               point_est = "mean") +
  xlab("log-odds")

As shown in section 4.3.4, we might want to communicate the posterior in proportions rather than in log-odds (as seen in the parameters beta). This can be done in R by manipulating the data frame df_fit_recall, or by extracting the samples with extract(fit_recall). Another alternative presented here is by using the generated quantities block. To make the code more compact, we declare the type of each variable and store its content in the same line in recall_prop.stan.

generated quantities {
  real average_accuracy = inv_logit(alpha);
  vector[K] change_acc = inv_logit(alpha) - inv_logit(alpha - beta);
}

Recall that due to the non-linearity of the scale, the effects depend on the average accuracy, and the set size and trial that we start from (in this case we are examining the change of one unit from the average set size and the average trial).

recall_prop <- system.file("stan_models",
                           "recall_prop.stan",
                           package = "bcogsci")
fit_recall <- stan(recall_prop, data = ls_recall)

The plot in Figure 8.10(b) now shows how the average accuracy deteriorates when the subject is exposed to a set size larger than the average by one, a trial further than the middle one, and the interaction of both.

df_fit_recall <- as.data.frame(fit_recall) %>%
  rename(set_size = `change_acc[1]`,
         trial = `change_acc[2]`,
         interaction = `change_acc[3]`)
mcmc_intervals(df_fit_recall,
               pars = c("set_size", "trial", "interaction"),
               prob_outer = .95,
               prob = .8,
               point_est = "mean") +
  xlab("Change in accuracy")

(a) 95% credible intervals for the beta parameters of the fit_recall model, representing the log-odds effects of set size, trial, and their interaction. (b) Impact of set size, trial, and their interaction on average recall accuracy.

FIGURE 8.10: (a) 95% credible intervals for the beta parameters of the fit_recall model, representing the log-odds effects of set size, trial, and their interaction. (b) Impact of set size, trial, and their interaction on average recall accuracy.

The plot in Figure 8.10(b) shows that our model is estimating that by increasing the set size by one unit, the recall accuracy of the single subject deteriorates by 2%. In contrast, there is hardly any trial effect or interaction between trial and set size.

8.5 Summary

This chapter introduced basic Stan syntax for fitting some standard linear models. Example code covered the normal, binomial, Bernoulli, and log-normal likelihoods. We also saw how to express regression models using the matrix model in Stan syntax.

8.6 Further reading

For further reading on the Hamiltonian Monte Carlo algorithm, see the rather technical review of Betancourt (2017), or the more conceptual introduction provided by Monnahan, Thorson, and Branch (2017) or chapter 17 of Lambert (2018). A useful article with example R code is Neal (2011). A detailed walk-through on its implementation is also provided in Chapter 41 of MacKay (2003). The Stan documentation (Stan Development Team 2024), consisting of a User’s Guide and the Language Reference Manual are important starting points for going deeper into Stan programming: https://mc-stan.org/users/documentation/.

References

Betancourt, Michael J. 2016. “Identifying the Optimal Integration Time in Hamiltonian Monte Carlo.” arXiv Preprint arXiv:1601.00225. http://arxiv.org/abs/1601.00225.

Betancourt, Michael J. 2017. “A Conceptual Introduction to Hamiltonian Monte Carlo.” arXiv Preprin: arXiv:1701.02434. http://arxiv.org/abs/1701.02434.

Carpenter, Bob, Andrew Gelman, Matthew D. Hoffman, Daniel Lee, Ben Goodrich, Michael J. Betancourt, Marcus Brubaker, Jiqiang Guo, Peter Li, and Allen Riddell. 2017. “Stan: A Probabilistic Programming Language.” Journal of Statistical Software 76 (1).

Gabry, Jonah, and Tristan Mahr. 2024. bayesplot: Plotting for Bayesian Models. https://mc-stan.org/bayesplot/.

Green, Peter J. 1998. “Penalized Likelihood.” In Encyclopedia of Statistical Sciences, 2:578–86. Wiley.

Guo, Jiqiang, Jonah Gabry, Ben Goodrich, Andrew Johnson, Sebastian Weber, and Hamada S. Badr. 2024. rstan: R Interface to Stan. https://mc-stan.org/rstan/.

Hoffman, Matthew D., and Andrew Gelman. 2014. “The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo.” Journal of Machine Learning Research 15 (1): 1593–1623. http://dl.acm.org/citation.cfm?id=2627435.2638586.

Lambert, Ben. 2018. A Student’s Guide to Bayesian Statistics. London, UK: Sage.

Lunn, David J., Chris Jackson, David J. Spiegelhalter, Nichola G. Best, and Andrew Thomas. 2012. The BUGS Book: A Practical Introduction to Bayesian Analysis. Vol. 98. CRC Press.

Lunn, David J., Andrew Thomas, Nichola G. Best, and David J. Spiegelhalter. 2000. “WinBUGS-A Bayesian Modelling Framework: Concepts, Structure, and Extensibility.” Statistics and Computing 10 (4): 325–37.

MacKay, David J. C. 2003. Information Theory, Inference and Learning Algorithms. Cambridge, UK: Cambridge University Press.

Monnahan, Cole C., James T. Thorson, and Trevor A. Branch. 2017. “Faster Estimation of Bayesian Models in Ecology Using Hamiltonian Monte Carlo.” Edited by Robert B. O’Hara. Methods in Ecology and Evolution 8 (3): 339–48. https://doi.org/10.1111/2041-210X.12681.

Neal, Radford M. 2011. “MCMC Using Hamiltonian Dynamics.” In Handbook of Markov Chain Monte Carlo, edited by Steve Brooks, Andrew Gelman, Galin Jones, and Xiao-Li Meng. Taylor & Francis. https://doi.org/10.1201/b10905-10.

Oberauer, Klaus. 2019. “Working Memory Capacity Limits Memory for Bindings.” Journal of Cognition 2 (1): 40. https://doi.org/10.5334/joc.86.

Plummer, Martin. 2016. “JAGS Version 4.2.0 User Manual.”

R Core Team. 2023. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

Scheipl, Fabian, Thomas Kneib, and Ludwig Fahrmeir. 2013. “Penalized Likelihood and Bayesian Function Selection in Regression Models.” AStA Advances in Statistical Analysis 97 (4): 349–85. https://doi.org/10.1007/s10182-013-0211-3.

Stan Development Team. 2024. “Stan Modeling Language Users Guide and Reference Manual, Version 2.32.” https://mc-stan.org/docs/2_35/.

Vehtari, Aki, Andrew Gelman, Daniel P. Simpson, Bob Carpenter, and Paul-Christian Bürkner. 2021. “Rank-Normalization, Folding, and Localization: An Improved $\widehat{R}$ for Assessing Convergence of MCMC.” Bayesian Analysis 16 (2): 667–718. https://doi.org/10.1214/20-BA1221.

Wahn, Basil, Daniel P. Ferris, W. David Hairston, and Peter König. 2016. “Pupil Sizes Scale with Attentional Load and Task Experience in a Multiple Object Tracking Task.” PLOS ONE 11 (12): e0168087. https://doi.org/10.1371/journal.pone.0168087.

Wickham, Hadley, Winston Chang, Lionel Henry, Thomas Lin Pedersen, Kohske Takahashi, Claus O. Wilke, Kara Woo, Hiroaki Yutani, and Teun van den Brand. 2024. ggplot2: Create Elegant Data Visualisations Using the Grammar of Graphics. https://ggplot2.tidyverse.org.

In the specific case of a model with two parameters, e.g., $\langle \mu,\sigma \rangle$, the physical analogy works quite well: The $\langle x,y \rangle$ coordinates of the particle would be determined by $\langle \mu,\sigma \rangle$, and its $z$ coordinate would be established by the height of the negative logarithm of the unnormalized posterior.↩︎
Incidentally, this $\log(q(\dot))$ is the variable target in a Stan model and lp__ in its output; see the aside below.↩︎
In this kind of output, there are some types that are new to the R user (but they are also used in C++): reals indicates that any of real, real[], vector, or row_vector. A return type R with an input type T indicates that the type of the output of the function is the same as type of the argument.↩︎
We simplify the output of print in the text after this call, by actually calling summary(fit, pars = pars, probs = c(0.025, 0.975))$summary.↩︎
In addition, the new package posterior provides useful tools for working with output from Bayesian models. It allows for converting the output between many different formats, and offers consistent methods for operations commonly performed on samples, as well as summary functions.↩︎
The notation for arrays has changed in recent versions, the previous notation would have been real[6, 7, 10] var.↩︎