Chapter 11 Hierarchical models and reparameterization

Now that we know how to fit simple regression models using Stan syntax, we can turn to more complex cases, such as the hierarchical models that we fit in brms in chapter 5. Fitting such models in Stan allows us a great deal of flexibility. However, a price to be paid when using Stan is that we need to think about how exactly we code the model. In some cases, two pieces of computer code that are mathematically similar might behave very differently due to the computer’s limitations; in this chapter, we will learn some of the more common techniques needed to optimize the model’s behavior. In particular, we will learn how to deal with convergence problems using what is called the non-centered reparameterization.

11.1 Hierarchical models with Stan

In the following sections, we will revisit and expand on some of the examples from chapter 5.

11.1.1 Varying intercept model with Stan

Recall that in section 5.2, we fit models to investigate the effect of cloze probability on EEG averages in the N400 spatiotemporal time window. For our first model, we’ll make the (implausible) assumption that only the average signal varies across subjects, but all subjects share the same effect of cloze probability. This means that the likelihood incorporates the assumption that the intercept, \(\alpha\), is adjusted with the term \(u_i\) for each subject.

\[\begin{equation} signal_n \sim \mathit{Normal}(\alpha + u_{subj[n]} + c\_cloze_n \cdot \beta,\sigma) \end{equation}\]

\[\begin{equation} \begin{aligned} \alpha &\sim \mathit{Normal}(0,10)\\ \beta &\sim \mathit{Normal}(0,10)\\ u &\sim \mathit{Normal}(0,\tau_u)\\ \tau_{u} &\sim \mathit{Normal}_+(0,20) \\ \sigma &\sim \mathit{Normal}_+(0,50) \end{aligned} \end{equation}\]

Here \(n\) represents each observation, the \(n\)th row in the data frame and \(subj[n]\) is the subject that corresponds to observation \(n\). We present the mathematical notation of the likelihood with “multiple indexing” (see Stan Development Team 2024, Chapter 19 of the User’s guide): the index of \(u\) is provided by the vector \(subj\).

Before we discuss the Stan implementation, let’s see what the vector \(\mu\), the location of the normal likelihood, looks like. There are 2863 observations; that means that \(\boldsymbol{\mu}=\langle \mu_1,\mu_2, \ldots, \mu_{2863}\rangle\). We have 37 subjects which means that \(\boldsymbol{u}=\langle u_1,u_2, \ldots, u_{37} \rangle\). The following equation shows that the use of multiple indexing allows us to have a vector of adjustments with only 37 different elements, with a total length of 2863. In the equation below, the multiplication operator \(\circ\) is the Hadamard or elementwise product (Fieller 2016): when we write \(X\circ B\), both \(X\) and \(B\) have the same dimensions \(m\times n\), and each cell in location \([i,j]\) (where \(i=1,\dots,m\), and \(j=1,\dots,n\)) in \(X\) and \(B\) are multiplied to give a matrix that also has dimensions \(m\times n\).43

\[\begin{equation} \begin{aligned} \boldsymbol{\mu} &= \begin{bmatrix} \mu_1 \\ \mu_2 \\ \ldots \\ \mu_{101} \\ \mu_{102} \\ \ldots \\ \mu_{215} \\ \mu_{216} \\ \mu_{217} \\ \ldots \\ \mu_{1000} \\ \ldots \\ \mu_{2863} \end{bmatrix} = \begin{bmatrix} \alpha \\ \alpha \\ \ldots \\ \alpha \\ \alpha \\ \ldots \\ \alpha \\ \alpha \\ \alpha \\ \ldots \\ \alpha \\ \ldots \\ \alpha \end{bmatrix} + \begin{bmatrix} u_{subj[1]} \\ u_{subj[2]} \\ \ldots \\ u_{subj[101]} \\ u_{subj[102]} \\ \ldots \\ u_{subj[215]} \\ u_{subj[216]} \\ u_{subj[217]} \\ \ldots \\ u_{subj[1000]} \\ \ldots \\ u_{subj[2863]} \end{bmatrix} + \begin{bmatrix} ccloze_1 \\ ccloze_2 \\ \ldots \\ ccloze_{101} \\ ccloze_{102} \\ \ldots \\ ccloze_{215} \\ ccloze_{216} \\ ccloze_{217} \\ \ldots \\ ccloze_{1000} \\ \ldots \\ ccloze_{2863} \end{bmatrix} \circ \begin{bmatrix} \beta \\ \beta \\ \ldots \\ \beta \\ \beta \\ \ldots \\ \beta \\ \beta \\ \beta \\ \ldots \\ \beta \\ \ldots \\ \beta \end{bmatrix} \\ & = \begin{bmatrix} \alpha \\ \alpha \\ \ldots \\ \alpha \\ \alpha \\ \ldots \\ \alpha \\ \alpha \\ \alpha \\ \ldots \\ \alpha \\ \ldots \\ \alpha \end{bmatrix} + \begin{bmatrix} u_{1} \\ u_{1} \\ \ldots \\ u_{2} \\ u_{2} \\ \ldots \\ u_{3} \\ u_{3} \\ u_{3} \\ \ldots \\ u_{13} \\ \ldots \\ u_{37 } \end{bmatrix} + \begin{bmatrix} {-0.476} \\ {-0.446} \\ \ldots \\ {-0.206} \\ {0.494} \\ \ldots \\ {-0.136} \\ {0.094} \\ {0.294} \\ \ldots \\ {0.524} \\ \ldots \\ {0.494 } \end{bmatrix} \circ \begin{bmatrix} \beta \\ \beta \\ \ldots \\ \beta \\ \beta \\ \ldots \\ \beta \\ \beta \\ \beta \\ \ldots \\ \beta \\ \ldots \\ \beta \end{bmatrix} \end{aligned} \tag{11.1} \end{equation}\]

In this model, each subject has their own intercept adjustment \(u_i\), with \(i\) indexing the subjects. If \(u_i\) is positive, the subject has a more positive EEG signal than the average over all the subjects; if \(u_i\) is negative, then the subject has a more negative EEG signal than the average; and if \(u_i\) is \(0\), then the subject has the same EEG signal as the average. As we discussed in section 5.2.3, since we are estimating \(\alpha\) and \(u\) at the same time and we assume that the average of the \(u\)’s is \(0\) (since it is assumed to be normally distributed with a mean of zero and there is an intercept to absorb any non-zero mean), whatever the subjects have in common “goes” to \(\alpha\), and \(u\) only “absorbs” the differences between subjects through the variance component \(\tau_u\).

This model is implemented in the file hierarchical1.stan, available in the bcogsci package:

data {
  int<lower=1> N;
  vector[N] signal;
  int<lower = 1> N_subj;
  vector[N] c_cloze;
  // The following line defines an array of integers;
  array[N] int<lower = 1, upper = N_subj> subj;
}
parameters {
  real<lower = 0> sigma;
  real<lower = 0>  tau_u;
  real alpha;
  real beta;
  vector[N_subj] u;
}
model {
  target += normal_lpdf(alpha| 0, 10);
  target += normal_lpdf(beta | 0, 10);
  target += normal_lpdf(sigma | 0, 50)  -
    normal_lccdf(0 | 0, 50);
  target += normal_lpdf(tau_u | 0, 20)  -
    normal_lccdf(0 | 0, 20);
  target += normal_lpdf(u | 0, tau_u);
  target += normal_lpdf(signal | alpha + u[subj] +
                        c_cloze * beta, sigma);
}

In the Stan code above, we use

array [N] int<lower = 1, upper = N_subj> subj;

to define a one-dimensional array of N elements that contains integers (bounded between 1 and N_subj). As explained in Box 10.4, the difference between vectors and one-dimensional arrays is that vectors can only contain real numbers and can be used with matrix algebra functions, and arrays can contain any type but can’t be used with matrix algebra functions. We use normal_lpdf rather than normal_glm_lpdf since at the moment there is no efficient likelihood implementation of hierarchical generalized linear models.

The following code centers the predictor cloze and stores the data required by the Stan model in a list. Because we are using subj as a vector of indices, we need to be careful to have integers starting from 1 and ending in N_subj without skipping any number (but the order of the subject ids won’t matter).44

data("df_eeg")
df_eeg <- df_eeg %>%
  mutate(c_cloze = cloze - mean(cloze))
ls_eeg <- list(N = nrow(df_eeg),
               signal = df_eeg$n400,
               c_cloze = df_eeg$c_cloze,
               subj = df_eeg$subj,
               N_subj = max(df_eeg$subj))

Fit the model:

hierarchical1 <- system.file("stan_models",
                             "hierarchical1.stan",
                             package = "bcogsci")
fit_eeg1 <- stan(hierarchical1, data = ls_eeg)

Summary of the model:

print(fit_eeg1, pars = c("alpha", "beta", "sigma", "tau_u"))
##        mean  2.5% 97.5% n_eff Rhat
## alpha  3.64  2.75  4.47  1730    1
## beta   2.32  1.25  3.34  5402    1
## sigma 11.64 11.33 11.95  5332    1
## tau_u  2.17  1.54  2.99  2801    1

11.1.2 Uncorrelated varying intercept and slopes model with Stan

In the following model, we relax the strong assumption that every subject will be affected equally by the manipulation. For ease of exposition, we start by assuming that (as we did in section 5.2.3) the adjustments for the intercept and slope are not correlated.

\[\begin{equation} signal_n \sim \mathit{Normal}(\alpha + u_{subj[n],1} + c\_cloze_n \cdot (\beta+ u_{subj[n],2}),\sigma) \tag{11.2} \end{equation}\]

\[\begin{equation} \begin{aligned} \alpha &\sim \mathit{Normal}(0,10)\\ \beta &\sim \mathit{Normal}(0,10)\\ u_1 &\sim \mathit{Normal}(0,\tau_{u_1})\\ u_2 &\sim \mathit{Normal}(0,\tau_{u_2})\\ \tau_{u_1} &\sim \mathit{Normal}_+(0,20) \\ \tau_{u_2} &\sim \mathit{Normal}_+(0,20) \\ \sigma &\sim \mathit{Normal}_+(0,50) \end{aligned} \tag{11.3} \end{equation}\]

We implement this in Stan in hierarchical2.stan:

data {
  int<lower=1> N;
  vector[N] signal;
  int<lower = 1> N_subj;
  vector[N] c_cloze;
  array[N] int<lower = 1, upper = N_subj> subj;
}
parameters {
  real<lower = 0> sigma;
  vector<lower = 0>[2]  tau_u;
  real alpha;
  real beta;
  matrix[N_subj, 2] u;
}
model {
  target += normal_lpdf(alpha| 0, 10);
  target += normal_lpdf(beta | 0, 10);
  target += normal_lpdf(sigma | 0, 50)  -
    normal_lccdf(0 | 0, 50);
  target += normal_lpdf(tau_u[1] | 0, 20)  -
    normal_lccdf(0 | 0, 20);
  target += normal_lpdf(tau_u[2] | 0, 20)  -
    normal_lccdf(0 | 0, 20);
  target += normal_lpdf(u[, 1]| 0, tau_u[1]);
  target += normal_lpdf(u[, 2]| 0, tau_u[2]);
  target += normal_lpdf(signal | alpha + u[subj, 1] +
                        c_cloze .* (beta + u[subj, 2]), sigma);
}

In the previous model, we assign the same prior distribution to both tau_u[1] and tau_u[2], and thus in principle we could have written the two statements in one (we multiply by 2 because there are two log-PDFs that need to be corrected for the truncation):

  target += normal_lpdf(tau_u | 0, 20)  - 
    2 * normal_lccdf(0 | 0, 20);

Fit the model as follows:

hierarchical2 <- system.file("stan_models",
                             "hierarchical2.stan",
                             package = "bcogsci")
fit_eeg2 <- stan(hierarchical2, data = ls_eeg)
 ## Warning: Bulk Effective Samples Size (ESS) is too low,
 ## indicating posterior means and medians may be unreliable.
 ## Running the chains for more iterations may help. See
 ## https://mc-stan.org/misc/warnings.html#bulk-ess 
 ## Warning: Tail Effective Samples Size (ESS) is too low,
 ## indicating posterior variances and tail quantiles may be
 ## unreliable. Running the chains for more iterations may
 ## help. See
 ## https://mc-stan.org/misc/warnings.html#tail-ess 

We see that there are warnings. As we increase the complexity and the number of parameters, the sampler has a harder time exploring the parameter space.

Show the summary of the model below:

print(fit_eeg2, pars = c("alpha", "beta", "tau_u", "sigma"))
##           mean  2.5% 97.5% n_eff Rhat
## alpha     3.65  2.81  4.42  1793 1.00
## beta      2.34  1.14  3.59  3755 1.00
## tau_u[1]  2.18  1.53  2.98  2351 1.00
## tau_u[2]  1.89  0.52  3.55   250 1.02
## sigma    11.61 11.31 11.93  6461 1.00

We see that tau_u[2] has a low number of effective samples (n_eff).

The traceplots are displayed in Figure 11.1:

traceplot(fit_eeg2, pars = c("alpha", "beta", "tau_u", "sigma"))
Traceplots of alpha, beta, tau_u, and sigma from the model fit_eeg2.

FIGURE 11.1: Traceplots of alpha, beta, tau_u, and sigma from the model fit_eeg2.

Figure 11.1 shows that the chains of the parameter tau_u[2] are not mixing properly. This parameter is especially problematic because there are not enough data from each subject to estimate this parameter accurately, its estimated mean is quite small (in comparison with sigma), it’s bounded by zero, and there is a dependency between this parameter and u[, 2]. This makes the exploration by the sampler quite hard.

Pairs plots can be useful to uncover pathologies in the sampling, since we can visualize dependencies between samples, which are in general problematic.45 The following code creates a pair plot where we see the samples of tau_u[2] against some of the adjustments to the slope u; see Figure 11.2.

pairs(fit_eeg2, pars = c("tau_u[2]", "u[1,2]", "u[2,2]", "u[3,2]"))
Pair plots showing a relatively strong dependency (funnel-shaped clouds of samples) between the samples of \(\tau_2\) and some of the by-subject adjustments to the slope.

FIGURE 11.2: Pair plots showing a relatively strong dependency (funnel-shaped clouds of samples) between the samples of \(\tau_2\) and some of the by-subject adjustments to the slope.

Compare with tau_u[1] plotted against the by-subject adjustments to the intercept. In Figure 11.3, instead of funnels we see blobs, indicating no strong dependency between the parameters.

pairs(fit_eeg2, pars = c("tau_u[1]", "u[1,1]", "u[2,1]", "u[3,1]"))
Pair plots showing no strong dependency (blob-shaped clouds of samples) between the samples of \(\tau_1\) and some of the by-subject adjustments to the intercept.

FIGURE 11.3: Pair plots showing no strong dependency (blob-shaped clouds of samples) between the samples of \(\tau_1\) and some of the by-subject adjustments to the intercept.

In contrast to Figure 11.3, Figure 11.2 shows a relatively strong dependency between the samples of some of the parameters of the model, in particular \(\tau_2\) and the samples of \(u_{i,2}\): If we see small values in the samples for \(\tau_2\), then the values for the samples of \(u_i\) are very close to zero (and they are larger if values of samples of \(\tau_2\) are larger). This strong dependency is hindering the exploration of the sampler leading to the warnings we saw in Stan. However, the problem that the sampler faces is, in fact, more serious than what our initial plots show. Stan samples in an unconstrained space where all the parameters can range from minus infinity to plus infinity, and then transforms back the parameters to the constrained space that we specified where, for example, a standard deviation parameter is restricted to be positive. This means that Stan is actually sampling from a transformed parameter equivalent to log(tau_u[2]) rather than from tau_u[2]. We can use mcmc_pairs to see the actual funnel; see Figure 11.4.

mcmc_pairs(as.array(fit_eeg2),
           pars = c("tau_u[2]", "u[1,2]"),
           transform = list(`tau_u[2]` = "log")) 
Pair plots showing a strong dependency (funnel-shaped clouds of samples) between the samples of \(\tau_2\) and one of the by-subject adjustments to the intercept (\(u_{1,2}\)).

FIGURE 11.4: Pair plots showing a strong dependency (funnel-shaped clouds of samples) between the samples of \(\tau_2\) and one of the by-subject adjustments to the intercept (\(u_{1,2}\)).

At the neck of the funnel, tau_u[2] is close to zero (and log(tau_u[2]) is a negative number) and thus the adjustment u is constrained to be near \(0\). This is a problem because a step size that’s optimized to work well in the broad part of the funnel will fail to work appropriately in the neck of the funnel and vice versa; see also Neal’s funnel (Neal 2003) and the optimization chapter of Stan Development Team (2024) (section 26.7 of the User’s guide). There are two options: We might just remove the by-subject varying slope since it’s not giving us much information anyway, or we can alleviate this problem by reparameterizing the model. In general, this sort of thing is the trickiest and probably most annoying part of modeling. A model can be theoretically and mathematically sound, but still fail to converge. The best advice to solve this type of problem is to start small with simulated data where we know the true values of the parameters, and increase the complexity of the models gradually. Although in this example the problem was clearly in the parameterization of tau_u[2], in many cases the biggest hurdle is to identify where the problem lies. Fortunately, the issue with tau_u[2] is a common problem which is easy to solve by using a non-centered parameterization (Betancourt and Girolami 2015; Papaspiliopoulos, Roberts, and Sköld 2007). The Box 11.1 explains the specific reparameterization we use for the improved version of our Stan code.

Box 11.1 A simple non-centered parameterization

The sampler can explore the parameter space more easily if its step size is appropriate for all the parameters. This is achieved when there are no strong dependencies between parameters. We want to assume the following.

\[\begin{equation} \mathbf{u}_{2} \sim \mathit{Normal}(0, \tau_{u_2}) \end{equation}\]

where \(\mathbf{u}_{2}\) is the column vector of \(u_{i,2}\)’s. The index \(i\) refers to the subject id.

We can transform a vector \(v\) into \(z\)-scores as follows.

\[\begin{equation} \mathbf{z}_{v} =\frac{\mathbf{v} - mean(\mathbf{v})}{SD(\mathbf{v})} \end{equation}\]

Transforming the parameter \(u_2\) into \(z\)-scores amounts to centering it, so we can call this a centered parameterization.

\[\begin{equation} \mathbf{z}_{u_2} =\frac{\mathbf{u}_{2} - 0}{\tau_{u_2}} \end{equation}\]

where \[\begin{equation} \mathbf{z}_{u_2} \sim \mathit{Normal}(0, 1) \end{equation}\]

Now \(\mathbf{z}_{u_2}\) is easier to sample because it doesn’t depend on other parameters (in particular, it is no longer conditional on \(\tau\)) and its scale is \(1\). Once we have sampled this centered parameter, we can derive the actual parameter we care about by carrying out the inverse operation, which is called a non-centered parameterization:

\[\begin{equation} \mathbf{u}_{2} = \mathbf{z}_{u_2} \cdot \tau_{u_2} \end{equation}\]

A question that might be raised here is whether using a non-centered parameterization is always a good idea. Betancourt and Girolami (2015) point out that the extremeness of the dependency depends on the amount of data, and the efficacy of the parameterization depends on the strength of the data (on how informative the data is). When there is enough data, this parameterization is unnecessary and it may be more efficient to use the centered parameterization. However, cases where there is enough data to render this parameterization useless might also be cases where the partial pooling of the hierarchical models may not be needed in the first place. Although data from conventional lab experiments in psychology, psycholinguistics, and related areas seem to benefit from the non-centered parameterization, the jury is still out for larger data sets with thousands of subjects from crowdsourcing websites.

From a mathematical point of view, the following model is equivalent to the one described in Equations (11.2) and (11.3). However, as discussed previously, the computational implementation of the “new” model is more efficient. The following model includes the reparameterization of both adjustments \(u_1\) and \(u_2\), although the reparameterization of \(u_1\) is not strictly necessary (we didn’t see any problems in the traceplots), it won’t hurt either and the Stan code will be simpler.

\[\begin{equation} signal_n \sim \mathit{Normal}(\alpha + u_{subj[n],1} + c\_cloze_n \cdot (\beta+ u_{subj[n],2}),\sigma) \tag{11.4} \end{equation}\]

\[\begin{equation} \begin{aligned} \alpha &\sim \mathit{Normal}(0,10)\\ \beta &\sim \mathit{Normal}(0,10)\\ z_{u_1} &\sim \mathit{Normal}(0, 1)\\ z_{u_2} &\sim \mathit{Normal}(0, 1)\\ \tau_{u_1} &\sim \mathit{Normal}_+(0,20) \\ \tau_{u_2} &\sim \mathit{Normal}_+(0,20) \\ \sigma &\sim \mathit{Normal}_+(0,50)\\ u_1 &= z_{u_1} \cdot \tau_{u_1}\\ u_2 &= z_{u_2} \cdot \tau_{u_2} \end{aligned} \tag{11.5} \end{equation}\]

The following Stan code (hierarchical3.stan) uses the previous parameterization, and introduces some new Stan functions: to_vector() converts a matrix into a long column vector (in column-major order, that is, concatenating the columns from left to right); and std_normal_lpdf() implements the log PDF of a standard normal distribution, a normal distribution with location 0 and scale 1. This function is just a more efficient version of normal_lpdf(... | 0, 1). We also introduce a new optional block called transformed parameters. With each iteration of the sampler, the values of the parameters (i.e., alpha, beta, sigma, and z) are available at the transformed parameters block, and we can derive new auxiliary variables based on them. In this case, we use z_u and tau_u to obtain u, that then becomes available in the model block. Notice that both the model block and R (when we output the stan object) can access both parameters and transformed parameters.

data {
  int<lower=1> N;
  vector[N] signal;
  int<lower = 1> N_subj;
  vector[N] c_cloze;
  array[N] int<lower = 1, upper = N_subj> subj;
}
parameters {
  real<lower = 0> sigma;
  vector<lower = 0>[2]  tau_u;
  real alpha;
  real beta;
  matrix[N_subj, 2] z_u;
}
transformed parameters {
  matrix[N_subj, 2] u;
  u[, 1] = z_u[, 1] * tau_u[1];
  u[, 2] = z_u[, 2] * tau_u[2];
 }
model {
  target += normal_lpdf(alpha| 0, 10);
  target += normal_lpdf(beta | 0, 10);
  target += normal_lpdf(sigma | 0, 50)  -
    normal_lccdf(0 | 0, 50);
  target += normal_lpdf(tau_u[1] | 0, 20)  -
    normal_lccdf(0 | 0, 20);
 target += normal_lpdf(tau_u[2] | 0, 20)  -
    normal_lccdf(0 | 0, 20);
  target += std_normal_lpdf(to_vector(z_u));
  target += normal_lpdf(signal | alpha + u[subj, 1] +
                        c_cloze .* (beta + u[subj, 2]), sigma);
}

By reparameterizing the model we can also optimize it more, we can convert the matrix z_u into a long column vector allowing us to use a single call of std_normal_lpdf. Fit the model named hierarchical3.stan.

hierarchical3 <- system.file("stan_models",
                             "hierarchical3.stan",
                             package = "bcogsci")
fit_eeg3 <- stan(hierarchical3, data = ls_eeg) 

Verify that the model worked as expected by printing its summary and traceplots; see Figure 11.5.

print(fit_eeg3, pars = c("alpha", "beta", "tau_u", "sigma"))
##           mean  2.5% 97.5% n_eff Rhat
## alpha     3.64  2.79  4.44  1285 1.00
## beta      2.35  1.13  3.60  3578 1.00
## tau_u[1]  2.16  1.53  2.95  1379 1.00
## tau_u[2]  1.75  0.16  3.52   946 1.01
## sigma    11.62 11.32 11.95  4847 1.00
traceplot(fit_eeg3, pars = c("alpha", "beta", "tau_u", "sigma"))
Traceplots of alpha, beta, tau_u, and sigma from the model fit_eeg3.

FIGURE 11.5: Traceplots of alpha, beta, tau_u, and sigma from the model fit_eeg3.

Although the samples of tau_u[2] are still conditioned by the adjustments for the slope, u[,2], these latter parameters are not the ones explored by the model, the auxiliary parameters, z_u, are the relevant ones for the sampler. The plots in Figures 11.6 and 11.7 show that although samples from log(tau_u[2]) and u[1,2] depend on each other, samples from log(tau_u[2]) and z_u[1,2] do not.


mcmc_pairs(as.array(fit_eeg3),
           pars = c("tau_u[2]", "u[1,2]"),
           transform = list(`tau_u[2]` = "log"))
Pair plots showing a clear dependency (funnel-shaped clouds of samples) between the samples of \(\tau_2\) and some of the by-subject adjustments to the slope.

FIGURE 11.6: Pair plots showing a clear dependency (funnel-shaped clouds of samples) between the samples of \(\tau_2\) and some of the by-subject adjustments to the slope.


mcmc_pairs(as.array(fit_eeg3),
           pars = c("tau_u[2]", "z_u[1,2]"),
           transform = list(`tau_u[2]` = "log"))
Pair plots showing no clear dependency between the samples of \(\tau_2\) and some of the by-subject auxiliary parameters (z_u) used to build the adjustments to the slope.

FIGURE 11.7: Pair plots showing no clear dependency between the samples of \(\tau_2\) and some of the by-subject auxiliary parameters (z_u) used to build the adjustments to the slope.

11.1.3 Correlated varying intercept varying slopes model

For the model with correlated varying intercepts and slopes, the likelihood remains identical to the model without a correlation between group-level intercepts and slopes. Priors and hyperpriors change to reflect the potential correlation between by-subject adjustments to intercepts and slopes:

\[\begin{equation} signal_n \sim \mathit{Normal}(\alpha + u_{subj[n],1} + c\_cloze_n \cdot (\beta + u_{subj[n],2}),\sigma) \end{equation}\]

The correlation is indicated in the priors on the adjustments for vector of by-subject intercepts \(u_{1}\) and the vector of by-subject slopes \(u_{2}\).

  • Priors: \[\begin{equation} \begin{aligned} \alpha & \sim \mathit{Normal}(0,10) \\ \beta & \sim \mathit{Normal}(0,10) \\ \sigma &\sim \mathit{Normal}_+(0,50)\\ {\begin{pmatrix} u_{i,1} \\ u_{i,2} \end{pmatrix}} &\sim {\mathcal {N}} \left( {\begin{pmatrix} 0\\ 0 \end{pmatrix}} ,\boldsymbol{\Sigma_u} \right) \end{aligned} \end{equation}\]

where \(i\) ranges from \(1\) to \(N_{subj}\)

\[\begin{equation} \boldsymbol{\Sigma_u} = {\begin{pmatrix} \tau_{u_1}^2 & \rho_u \tau_{u_1} \tau_{u_2} \\ \rho_u \tau_{u_1} \tau_{u_2} & \tau_{u_2}^2 \end{pmatrix}} \end{equation}\]

\[\begin{equation} \begin{aligned} \tau_{u_1} &\sim \mathit{Normal}_+(0,20)\\ \tau_{u_2} &\sim \mathit{Normal}_+(0,20)\\ \mathbf{R_u} & = \begin{bmatrix} 1 & \rho_u \\ \rho_u & 1 \end{bmatrix} \sim \mathit{LKJcorr}(2) \end{aligned} \end{equation}\]

As a first attempt, we write this model following the mathematical notation as closely as possible. We’ll see that this will be problematic in terms of efficient sampling and convergence. In this Stan model (hierarchical_corr.stan), we use some new functions and types:

  • corr_matrix[n] R; defines a (square) matrix of \(n\) rows and \(n\) columns called \(R\), symmetrical around a diagonal of ones.
  • rep_row_vector(X, n) creates a row vector with \(n\) columns filled with \(X\).
  • quad_form_diag(R, v) creates a quadratic form using the column vector \(v\) as a diagonal matrix (a matrix with all zeros except for its diagonal), this function corresponds in Stan to: diag_matrix(v) * R * diag_matrix(v) and in R to diag(v) %*% R %*% diag(v). This computes a variance-covariance matrix from the vector of standard deviations, \(v\), and the correlation matrix, \(R\) (recall the generation of multivariate data in section 1.6.3).
data {
  int<lower=1> N;
  vector[N] signal;
  int<lower = 1> N_subj;
  vector[N] c_cloze;
  array[N] int<lower = 1, upper = N_subj> subj;
}
parameters {
  real<lower = 0> sigma;
  vector<lower = 0>[2]  tau_u;
  real alpha;
  real beta;
  matrix[N_subj, 2] u;
  corr_matrix[2] R_u;
}
model {
  target += normal_lpdf(alpha| 0,10);
  target += normal_lpdf(beta | 0,10);
  target += normal_lpdf(sigma | 0, 50)  -
    normal_lccdf(0 | 0, 50);
  target += normal_lpdf(tau_u | 0, 20)  -
    2 * normal_lccdf(0 | 0, 20);
  target += lkj_corr_lpdf(R_u | 2);
  for(i in 1:N_subj)
    target +=  multi_normal_lpdf(u[i,] |
                                 rep_row_vector(0, 2),
                                 quad_form_diag(R_u, tau_u));
  target += normal_lpdf(signal | alpha + u[subj, 1] +
                        c_cloze .* (beta + u[subj, 2]), sigma);
}

Problematic aspects of the first model presented in section 11.1.2 (before the reparameterization), that is, dependencies between parameters, are also present here. Fit the model as follows:

hierarchical_corr <- system.file("stan_models",
                                 "hierarchical_corr.stan",
                                 package = "bcogsci") 
fit_eeg_corr <- stan(hierarchical_corr, data = ls_eeg)
 ## Warning: There were 1 divergent transitions after
 ## warmup. See
 ## https://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
 ## to find out why this is a problem and how to eliminate
 ## them. 
 ## Warning: There were 3 chains where the estimated
 ## Bayesian Fraction of Missing Information was low. See
 ## https://mc-stan.org/misc/warnings.html#bfmi-low 
 ## Warning: Examine the pairs() plot to diagnose sampling
 ## problems 
 ## Warning: The largest R-hat is NA, indicating chains have
 ## not mixed. Running the chains for more iterations may
 ## help. See https://mc-stan.org/misc/warnings.html#r-hat 
 ## Warning: Bulk Effective Samples Size (ESS) is too low,
 ## indicating posterior means and medians may be unreliable.
 ## Running the chains for more iterations may help. See
 ## https://mc-stan.org/misc/warnings.html#bulk-ess 
 ## Warning: Tail Effective Samples Size (ESS) is too low,
 ## indicating posterior variances and tail quantiles may be
 ## unreliable. Running the chains for more iterations may
 ## help. See
 ## https://mc-stan.org/misc/warnings.html#tail-ess 

As we expected, there are warnings and bad mixing of the chains for tau_u[2]; see also the traceplots in Figure 11.8.

print(fit_eeg_corr, pars = c("alpha", "beta", "tau_u", "sigma"))
##           mean  2.5% 97.5% n_eff Rhat
## alpha     3.63  2.83  4.46  1362  1.0
## beta      2.33  1.11  3.57  4343  1.0
## tau_u[1]  2.19  1.53  2.99  2168  1.0
## tau_u[2]  1.56  0.17  3.43    39  1.1
## sigma    11.62 11.31 11.93  5458  1.0

traceplot(fit_eeg_corr, pars = c("alpha", "beta", "tau_u", "sigma"))
Traceplots of alpha, beta, tau_u, and sigma from the model fit_eeg_corr.

FIGURE 11.8: Traceplots of alpha, beta, tau_u, and sigma from the model fit_eeg_corr.

The problem (which can also be discovered in a pairs plot) is the same one that we saw before: There is a strong dependency between tau_u[2] (in fact, log(tau_u[2]), which is the parameter dimension that the sampler considers) and u, creating a funnel.

The solution to this problem is the reparameterization of this model. The reparameterization for this type of model requires us to use Cholesky factorization (Fieller 2016). The mathematics and the intuition behind this parameterization are explained in Box 11.2.

Box 11.2 Cholesky factorization

First, some definitions that we will need below. A matrix is square if the number of rows and columns is identical. A square matrix \(A\) is symmetric if \(A^T = A\), i.e., if transposing the matrix gives the matrix back. Suppose that \(A\) is a known matrix with real numbers. If \(\boldsymbol{x}\) is a vector of variables with length \(p\) (a \(p\times 1\) matrix), then \(x^T A x\) is called a quadratic form in \(x\) (\(x^T A x\) will be a scalar, \(1\times 1\)). If \(x^TAx>0\) for all \(x\), then \(A\) is a positive definite matrix. If \(x^TAx\geq 0\) for all \(x\neq0\), then \(A\) is positive semi-definite.

We encountered correlation matrices first in section 1.6.3. A correlation matrix is always symmetric, has ones along the diagonal, and real values ranging between \(-1\) and \(1\) on the off-diagonals. Given a correlation matrix \(\mathbf{R_u}\), we can decompose it into a lower triangular matrix \(\mathbf{L_u}\) such that \(\mathbf{L_u}\mathbf{L_u}^T=\mathbf{R_u}\). The matrix \(\mathbf{L_u}\) is called the Cholesky factor of \(\mathbf{R_u}\). Intuitively, you can think of \(\mathbf{L_u}\) as the matrix equivalent of the square root of \(\mathbf{R_u}\). More details on the Cholesky factorization can be found in Gentle (2007).

\[\begin{equation} \mathbf{L_u} = {\begin{pmatrix} L_{11} & 0 \\ L_{21} & L_{22} \end{pmatrix}} \end{equation}\]

For a model without a correlation between adjustments for the intercept and slope, we assumed that adjustments \(u_{1}\) and \(u_{2}\) were generated by two independent normal distributions. But now we want to allow the possibility that the adjustments can have a non-zero correlation. We can use the Cholesky factorization to generate correlated random variables in the following way.

  1. Generate uncorrelated vectors, \(z_{u_1}\) and \(z_{u_2}\), for each vector of adjustments \(u_1\) and \(u_2\), as sampled from \(\mathit{Normal}(0,1)\):

\[z_{u_1} \sim \mathit{Normal}(0,1)\] \[z_{u_2} \sim \mathit{Normal}(0,1)\]

  1. By multiplying the Cholesky factor with the \(z\)’s, generate a matrix that contains two row vectors of correlated variables (with standard deviation \(1\)).

\[ \mathbf{L_u}\cdot \mathbf{z_u} = {\begin{pmatrix} L_{11} & 0 \\ L_{21} & L_{22} \end{pmatrix}} {\begin{pmatrix} z_{u_{1,subj=1}} & z_{u_{1,subj=2}} & ... & z_{u_{1,subj=N_{subj}}} \\ z_{u_{2,subj=1}} & z_{u_{2,subj=2}} & ... & z_{u_{2,subj=N_{subj}}} \end{pmatrix}} \]

\[ = {\begin{pmatrix} L_{11} \cdot z_{u_{1,1}} + 0 \cdot z_{u_{2,1}} & ... & L_{11} \cdot z_{u_{1,N_{subj}}} + 0 \cdot z_{u_{2,1}} \\ L_{21} \cdot z_{u_{1,1}} + L_{22} \cdot z_{u_{2,1}} & ... & L_{11} \cdot z_{u_{1,N_{subj}}} + L_{22} \cdot z_{u_{2,N_{subj}}} \end{pmatrix}} \]

A very informal explanation of why this works is that we are making the variable that corresponds to the slope to be a function of a scaled version of the intercept.

  1. The last step is to scale the previous matrix to the desired standard deviation. We define the diagonalized matrix \(diag\_matrix(\tau_u)\) as before:

\[ {\begin{pmatrix} \tau_{u_1} & 0 \\ 0 & \tau_{u_2} \end{pmatrix}} \]

Then pre-multiply it by the correlated variables with standard deviation 1 from before:

\[\mathbf{u} = diag\_matrix(\tau_u) \cdot \mathbf{L_u}\cdot \mathbf{z_u} = \]

\[ {\begin{pmatrix} \tau_{u_1} & 0 \\ 0 & \tau_{u_2} \end{pmatrix}} {\begin{pmatrix} L_{11} \cdot z_{u_{1,1}} & ... \\ L_{21} \cdot z_{u_{1,1}} + L_{22} \cdot z_{u_{2,1}} & ... \end{pmatrix}} \]

\[ {\begin{pmatrix} \tau_{u_1} \cdot L_{11} \cdot z_{u_{1,1}} & \tau_{u_1} \cdot L_{11} \cdot z_{u_{1,2}} & ... \\ \tau_{u_2} \cdot (L_{21} \cdot z_{u_{1,1}} + L_{22} \cdot z_{u_{2,1}}) & \tau_{u_2} \cdot (L_{21} \cdot z_{u_{1,2}} + L_{22} \cdot z_{u_{2,2}}) & ... \end{pmatrix}} \]

It might be helpful to see how one would implement this in R:

Let’s assume a correlation of \(0.8\).

rho_u <- 0.8
# Correlation matrix
(R_u <- matrix(c(1, rho_u, rho_u, 1), ncol = 2))
##      [,1] [,2]
## [1,]  1.0  0.8
## [2,]  0.8  1.0
# Cholesky factor:
# (Transpose it so that it looks the same as in Stan)
(L_u <- t(chol(R_u)))
##      [,1] [,2]
## [1,]  1.0  0.0
## [2,]  0.8  0.6
# Verify that we recover R_u,
# Recall that %*% indicates matrix multiplication
L_u %*% t(L_u)
##      [,1] [,2]
## [1,]  1.0  0.8
## [2,]  0.8  1.0
  1. Generate uncorrelated z from a standard normal distribution assuming only 10 subjects.
N_subj <- 10
(z_u1 <- rnorm(N_subj, 0, 1))
##  [1]  1.0966  1.4457 -1.9251  0.4128  1.5934 -0.4140 -0.2122
##  [8] -0.0365  0.3650  0.6652
(z_u2 <- rnorm(N_subj, 0, 1))
##  [1]  1.31782 -0.09549  0.19628  2.48800  0.43110  0.18875
##  [7] -1.34224  0.00286 -0.22133 -0.01105
  1. Create matrix of correlated parameters.

First, create a matrix with the uncorrelated parameters:

# matrix z_u
(z_u <- matrix(c(z_u1, z_u2), ncol = N_subj, byrow = TRUE))
##      [,1]    [,2]   [,3]  [,4]  [,5]   [,6]   [,7]     [,8]
## [1,] 1.10  1.4457 -1.925 0.413 1.593 -0.414 -0.212 -0.03654
## [2,] 1.32 -0.0955  0.196 2.488 0.431  0.189 -1.342  0.00286
##        [,9]  [,10]
## [1,]  0.365  0.665
## [2,] -0.221 -0.011

Then, generate correlated parameters by pre-multiplying the \(\mathbf{z}_u\) matrix with \(\mathbf{L}_u\).

L_u %*% z_u
##      [,1] [,2]  [,3]  [,4] [,5]   [,6]   [,7]    [,8]  [,9] [,10]
## [1,] 1.10 1.45 -1.93 0.413 1.59 -0.414 -0.212 -0.0365 0.365 0.665
## [2,] 1.67 1.10 -1.42 1.823 1.53 -0.218 -0.975 -0.0275 0.159 0.526
  1. Use the following diagonal matrix to scale the z_u.
tau_u1 <- 0.2
tau_u2 <- 0.01
(diag_matrix_tau <- diag(c(tau_u1,  tau_u2)))
##      [,1] [,2]
## [1,]  0.2 0.00
## [2,]  0.0 0.01
  1. Finally, generate the adjustments for each subject u:
(u <- diag_matrix_tau %*% L_u %*% z_u)
##        [,1]  [,2]    [,3]   [,4]   [,5]     [,6]     [,7]
## [1,] 0.2193 0.289 -0.3850 0.0826 0.3187 -0.08280 -0.04243
## [2,] 0.0167 0.011 -0.0142 0.0182 0.0153 -0.00218 -0.00975
##           [,8]    [,9]   [,10]
## [1,] -0.007307 0.07300 0.13303
## [2,] -0.000275 0.00159 0.00526
# The rows are correlated, approximately 0.8
cor(u[1, ], u[2, ])
## [1] 0.848
# The variance components can be recovered as well:
sd(u[1, ])
## [1] 0.207
sd(u[2, ])
## [1] 0.0112

The reparameterization of the model, which allows for a correlation between adjustments for the intercepts and slopes, in hierarchical_corr2.stan is shown below. The code implements the following new types and functions:

  • cholesky_factor_corr[2] L_u, which defines L_u as a lower triangular (\(2 \times 2\)) matrix which has to be the Cholesky factor of a correlation.
  • diag_pre_multiply(tau_u,L_u) which makes a diagonal matrix out of the vector tau_u and multiplies it by L_u.
  • to_vector(z_u) makes a long vector out the matrix z_u.
  • lkj_corr_cholesky_lpdf(L_u | 2) is the Cholesky factor associated with the LKJ correlation distribution. It implies that lkj_corr_lpdf(L_u * L_u'| 2). The symbol ' indicates transposition (in R, this is the function t()).
data {
  int<lower=1> N;
  vector[N] signal;
  int<lower = 1> N_subj;
  vector[N] c_cloze;
  array[N] int<lower = 1, upper = N_subj> subj;
}
parameters {
  real<lower = 0> sigma;
  vector<lower = 0>[2]  tau_u;
  real alpha;
  real beta;
  matrix[2, N_subj] z_u;
  cholesky_factor_corr[2] L_u;
}
transformed parameters {
  matrix[N_subj, 2] u;
  u = (diag_pre_multiply(tau_u, L_u) * z_u)';
}
model {
  target += normal_lpdf(alpha| 0,10);
  target += normal_lpdf(beta | 0,10);
  target += normal_lpdf(sigma | 0, 50)  -
    normal_lccdf(0 | 0, 50);
  target += normal_lpdf(tau_u | 0, 20)  -
    2 * normal_lccdf(0 | 0, 20);
  target += lkj_corr_cholesky_lpdf(L_u | 2);
  target += std_normal_lpdf(to_vector(z_u));
  target += normal_lpdf(signal | alpha + u[subj, 1] +
                        c_cloze .* (beta + u[subj, 2]), sigma);
}
generated quantities {
  matrix[2, 2] R_u = (L_u * L_u');
  real rho_u = R_u[1, 2];
  vector[N_subj] effect_by_subj = beta + u[ , 2];
}

In this Stan model, we also defined an effect_by_subject in the generated quantities. This would allow us to plot or to summarize by-subject effects of cloze probability.

One can recover the correlation parameter by adding in the generated quantities section code to extract one element of the \(2 \times 2\) correlation matrix. The correlation matrix is recovered with L_u * L_u', and then [1, 2] extracts the element in the first row and second column (recall that the diagonal is filled with ones).

Fit the new model:

hierarchical_corr2 <- system.file("stan_models",
                                  "hierarchical_corr2.stan",
                                  package = "bcogsci")
fit_eeg_corr2 <- stan(hierarchical_corr2, data = ls_eeg)

The Cholesky matrix has some elements which are always zero or one, and thus the variance within and between chains (and therefore Rhat) are not defined. However, the rest of the parameters of the model have an appropriate number of effective sample size (more than 10% of the total number of post-warmup samples), Rhats are close to one, and the chains are mixing well; see also the traceplots in Figure 11.9.

print(fit_eeg_corr2,
      pars =
        c("alpha", "beta", "tau_u", "rho_u", "sigma", "L_u")) 
##           mean  2.5% 97.5% n_eff Rhat
## alpha     3.64  2.82  4.45  1432    1
## beta      2.31  1.06  3.53  3868    1
## tau_u[1]  2.20  1.56  3.02  1498    1
## tau_u[2]  1.61  0.13  3.48   990    1
## rho_u     0.17 -0.56  0.77  3721    1
## sigma    11.61 11.33 11.91  5204    1
## L_u[1,1]  1.00  1.00  1.00   NaN  NaN
## L_u[1,2]  0.00  0.00  0.00   NaN  NaN
## L_u[2,1]  0.17 -0.56  0.77  3721    1
## L_u[2,2]  0.92  0.61  1.00  2136    1

traceplot(fit_eeg_corr2,
          pars = c("alpha", "beta", "tau_u", "L_u[2,1]", "L_u[2,2]", "sigma")) 
`## Warning: Line is too long.`
Traceplots of alpha, beta, tau_u, L_u, and sigma from the model fit_eeg_corr.

FIGURE 11.9: Traceplots of alpha, beta, tau_u, L_u, and sigma from the model fit_eeg_corr.

Is there a correlation between the by-subject intercept and slope?

Let’s visualize some of the posteriors with the following code (see Figure 11.10):

mcmc_hist(as.data.frame(fit_eeg_corr2),
          pars = c("beta", "rho_u")) 
Histograms of the samples of the posteriors of beta and rho_u from the model fit_eeg_corr2.

FIGURE 11.10: Histograms of the samples of the posteriors of beta and rho_u from the model fit_eeg_corr2.

Figure 11.10 shows that the posterior distribution is widely spread out between \(-1\) and \(+1\). One can’t really learn from these data whether the by-subject intercepts and slopes are correlated. The broad spread of the posterior indicates that we don’t have enough data to estimate this parameter with high enough precision: the posterior is basically just reflecting the prior specification (the LKJcorr prior with parameter \(\eta = 2\)).

11.1.4 By-subject and by-items correlated varying intercept varying slopes model

We extend the previous model by adding by-items intercepts and slopes, and priors and hyperpriors that reflect the potential correlation between by-items adjustments to intercepts and slopes:

\[\begin{multline} signal_n \sim \mathit{Normal}(\alpha + u_{subj[n], 1} + w_{item[n], 1} + \\ c\_cloze_n \cdot (\beta + u_{subj[n],2} + w_{item[n], 2}),\sigma) \end{multline}\]

The correlation is indicated in the priors on the adjustments for the vectors representing the varying intercepts \(u_{1}\) and varying slopes \(u_{2}\) for subjects, and the varying intercepts \(w_{1}\) and varying slopes \(w_{2}\) for items.

  • Priors: \[\begin{equation} \begin{aligned} \alpha & \sim \mathit{Normal}(0,10) \\ \beta & \sim \mathit{Normal}(0,10) \\ \sigma &\sim \mathit{Normal}_+(0, 50)\\ {\begin{pmatrix} u_{i,1} \\ u_{i,2} \end{pmatrix}} &\sim {\mathcal {N}} \left( {\begin{pmatrix} 0\\ 0 \end{pmatrix}} ,\boldsymbol{\Sigma_u} \right) \\ {\begin{pmatrix} w_{i,1} \\ w_{i,2} \end{pmatrix}} &\sim {\mathcal {N}} \left( {\begin{pmatrix} 0\\ 0 \end{pmatrix}} ,\boldsymbol{\Sigma_w} \right) \end{aligned} \end{equation}\]

\[\begin{equation} \boldsymbol{\Sigma_u} = {\begin{pmatrix} \tau_{u_1}^2 & \rho_u \tau_{u_1} \tau_{u_2} \\ \rho_u \tau_{u_1} \tau_{u_1} & \tau_{u_2}^2 \end{pmatrix}} \end{equation}\]

\[\begin{equation} \boldsymbol{\Sigma_w} = {\begin{pmatrix} \tau_{w_1}^2 & \rho_w \tau_{w_1} \tau_{w_2} \\ \rho_w \tau_{w_1} \tau_{w_1} & \tau_{w_2}^2 \end{pmatrix}} \end{equation}\]

\[\begin{equation} \begin{aligned} \tau_{u_1} &\sim \mathit{Normal}_+(0,20)\\ \tau_{u_2} &\sim \mathit{Normal}_+(0,20)\\ \mathbf{R_u} & = \begin{bmatrix} 1 & \rho_u \\ \rho_u & 1 \end{bmatrix} \sim \mathit{LKJcorr}(2) \end{aligned} \end{equation}\]

\[\begin{equation} \begin{aligned} \tau_{w_1} &\sim \mathit{Normal}_+(0,20)\\ \tau_{w_2} &\sim \mathit{Normal}_+(0,20)\\ \mathbf{R_w} & = \begin{bmatrix} 1 & \rho_w \\ \rho_w & 1 \end{bmatrix} \sim \mathit{LKJcorr}(2) \end{aligned} \end{equation}\]

The translation to Stan (hierarchical_corr_by.stan) looks as follows:

data {
  int<lower=1> N;
  vector[N] signal;
  int<lower = 1> N_subj;
  int<lower = 1> N_item;
  vector[N] c_cloze;
  array[N] int<lower = 1, upper = N_subj> subj;
  array[N] int<lower = 1, upper = N_item> item;
}
parameters {
  real<lower = 0> sigma;
  vector<lower = 0>[2]  tau_u;
  vector<lower = 0>[2]  tau_w;
  real alpha;
  real beta;
  matrix[2, N_subj] z_u;
  matrix[2, N_item] z_w;
  cholesky_factor_corr[2] L_u;
  cholesky_factor_corr[2] L_w;
}
transformed parameters {
  matrix[N_subj, 2] u;
  matrix[N_item, 2] w;
  u = (diag_pre_multiply(tau_u, L_u) * z_u)';
  w = (diag_pre_multiply(tau_w, L_w) * z_w)';
}
model {
  target += normal_lpdf(alpha| 0,10);
  target += normal_lpdf(beta | 0,10);
  target += normal_lpdf(sigma | 0, 50)  -
    normal_lccdf(0 | 0, 50);
  target += normal_lpdf(tau_u | 0, 20) -
    2 * normal_lccdf(0 | 0, 20);
  target += normal_lpdf(tau_w | 0, 20) -
    2* normal_lccdf(0 | 0, 20);
  target += lkj_corr_cholesky_lpdf(L_u | 2);
  target += lkj_corr_cholesky_lpdf(L_w | 2);
  target += std_normal_lpdf(to_vector(z_u));
  target += std_normal_lpdf(to_vector(z_w));
  target += normal_lpdf(signal | alpha + u[subj, 1] + w[item, 1] +
                        c_cloze .* (beta + u[subj, 2] + w[item, 2]),
                        sigma);
}
generated quantities {
  real rho_u = (L_u * L_u')[1, 2];
  real rho_w = (L_w * L_w')[1, 2];
}

Add item as a number to the data and store it in a list:

df_eeg <- df_eeg %>%
  mutate(item = as.numeric(as.factor(item)))
ls_eeg <- list(N = nrow(df_eeg),
               signal = df_eeg$n400,
               c_cloze = df_eeg$c_cloze,
               subj = df_eeg$subj,
               item = df_eeg$item,
               N_subj = max(df_eeg$subj),
               N_item = max(df_eeg$item))

Fit the model:

hierarchical_corr_by <- system.file("stan_models",
                                    "hierarchical_corr_by.stan",
                                    package = "bcogsci")
fit_eeg_corr_by <- stan(hierarchical_corr_by, data = ls_eeg)

Print the summary:

print(fit_eeg_corr_by,
      pars = c("alpha", "beta", "sigma", "tau_u", "tau_w", 
               "rho_u", "rho_w")) 
##           mean  2.5% 97.5% n_eff Rhat
## alpha     3.65  2.72  4.57  1851    1
## beta      2.32  0.94  3.63  3076    1
## sigma    11.48 11.19 11.80  4485    1
## tau_u[1]  2.22  1.56  3.06  1510    1
## tau_u[2]  1.50  0.09  3.38  1229    1
## tau_w[1]  1.52  0.88  2.20  1261    1
## tau_w[2]  2.30  0.28  4.24   982    1
## rho_u     0.14 -0.61  0.77  3465    1
## rho_w    -0.40 -0.89  0.33  2131    1

The summary above shows that the data are far too sparse to get tight estimates of the correlation parameters rho_u and rho_w. Both posteriors are widely spread out.

This completes our review of hierarchical models and their implementation in Stan. The importance of coding a hierarchical model directly in Stan rather than using brms is that this increases the flexibility of the type of models that we can fit. In fact, we will see in chapters 17-20 that the same “machinery” can be used to have hierarchical parameters in cognitive models.

11.2 Summary

In this chapter, we learned to fit the four standard types of hierarchical models that we encountered in earlier chapters:

  • The by-subjects varying intercepts model.
  • The by-subjects varying intercepts and varying slopes model without any correlation.
  • The by-subjects varying intercepts and varying slopes model with correlation.
  • The hierarchical model, with a full variance covariance matrix for both subjects and items.

We also learned about some important and powerful tools for making the Stan models more efficient at sampling: the non-centered parameterization and the Cholesky factorization. One important takeaway was that if data are sparse, the posteriors will just reflect the priors. We saw examples of this situation when investigating the posteriors of the correlation parameters.

11.3 Further reading

Gelman and Hill (2007) provides a comprehensive introduction to Bayesian hierarchical models, although that edition does not use Stan but rather WinBUGS. Sorensen, Hohenstein, and Vasishth (2016) is a short tutorial on hierarchical modeling using Stan, especially tailored for psychologists and linguists.

An additional example of reparameterization (kindly suggested by Martin Modrák) involves reparameterizing a varying intercept/effects model in terms of total variance and its distribution across individual components. This approach has both theoretical advantages, such as making it easier to elicit priors on total variance and its split, and practical benefits, such as allowing us to work with data sets where only a small number of subjects have repeated measurements, which might otherwise pose computational challenges. This concept is explored in more depth in Hem, Fuglstad, and Riebler (2022). Additionally, a short gist provided by Martin Modrák demonstrates how this reparameterization resolves divergences and increases the effective sample size in a model adapted from this chapter. At the time of writing, Modrák’s gist is available at https://gist.github.com/martinmodrak/61a68f22ed954ae68abc76faeef860b6.

11.4 Exercises

Exercise 11.1 A log-normal model in Stan.

Refit the Stroop example from section 5.3 in Stan (df_stroop).

Assume the following likelihood and priors:

\[\begin{equation} rt_n \sim \mathit{LogNormal}(\alpha + u_{subj[n],1} + c\_cond_n \cdot (\beta + u_{subj[n],2}), \sigma) \end{equation}\]

\[\begin{equation} \begin{aligned} \alpha & \sim \mathit{Normal}(6, 1.5) \\ \beta & \sim \mathit{Normal}(0, .1) \\ \sigma &\sim \mathit{Normal}_+(0, 1) \end{aligned} \end{equation}\]

\[\begin{equation} \begin{aligned} \tau_{u_1} &\sim \mathit{Normal}_+(0,1)\\ \tau_{u_2} &\sim \mathit{Normal}_+(0,1)\\ \begin{bmatrix} 1 & \rho_u \\ \rho_u & 1 \end{bmatrix} &\sim \mathit{LKJcorr}(2) \end{aligned} \end{equation}\]

Exercise 11.2 A by-subjects and by-items hierarchical model with a log-normal likelihood.

Revisit the question “Are subject relatives easier to process than object relatives?” Fit the model from the exercise 5.2 using Stan.

Exercise 11.3 A hierarchical logistic regression with Stan.

Revisit the question “Is there a Stroop effect in accuracy?” Fit the model the exercise 5.6 using Stan.

Exercise 11.4 A distributional regression model of the effect of cloze probability on the N400.

In section 5.2.6, we saw how to fit a distributional regression model. We might want to extend this approach to Stan. Fit the EEG data to a hierarchical model with by-subject and by-items varying intercept and slopes, and in addition assume that the residual standard deviation (the scale of the normal likelihood) can vary by subject.

\[\begin{equation} \begin{aligned} signal_n &\sim \mathit{Normal}(\alpha + u_{subj[n],1} + w_{item[n],1} + \\ & c\_cloze_n \cdot (\beta + u_{subj[n],2}+ w_{item[n],2}), \sigma_n)\\ \sigma_n &= \exp(\alpha_{\sigma} + u_{\sigma_{subj[n]}}) \end{aligned} \end{equation}\]

\[\begin{equation} \begin{aligned} \alpha_\alpha &\sim \mathit{Normal}(0,log(50))\\ u_\sigma &\sim \mathit{Normal}(0, \tau_{u_\sigma}) \\ \tau_{u_\sigma} &\sim \mathit{Normal}_+(0, 5) \end{aligned} \end{equation}\]

To fit this model, take into account that sigma is now a vector, and it is a transformed parameter which depends on two parameters: alpha_sigma and the vector with N_subj elements u_sigma. In addition, u_sigma depends on the hyperparameter tau_u_sigma (\(\tau_{u_\sigma}\)). (Using the non-centered parameterization for u_sigma speeds up the model fit considerably).

References

Betancourt, Michael J., and Mark Girolami. 2015. “Hamiltonian Monte Carlo for Hierarchical Models.” Current Trends in Bayesian Methodology with Applications 79 (30): 2–4.

Fieller, Nick. 2016. Basics of Matrix Algebra for Statistics with R. Boca Raton, FL: CRC Press.

Gelman, Andrew, and Jennifer Hill. 2007. Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge University Press.

Gentle, James E. 2007. “Matrix Algebra: Theory, Computations, and Applications in Statistics.” Springer Texts in Statistics 10.

Hem, Ingeborg Gullikstad, Geir-Arne Fuglstad, and Andrea Riebler. 2022. “makemyprior: Intuitive Construction of Joint Priors for Variance Parameters in R.” https://arxiv.org/abs/2105.09712.

Neal, Radford M. 2003. “Slice Sampling.” Annals of Statistics 31 (3): 705–67. https://doi.org/10.1214/aos/1056562461.

Papaspiliopoulos, Omiros, Gareth O. Roberts, and Martin Sköld. 2007. “A General Framework for the Parametrization of Hierarchical Models.” Statistical Science 22 (1): 59–73. https://doi.org/10.1214/088342307000000014.

Sorensen, Tanner, Sven Hohenstein, and Shravan Vasishth. 2016. “Bayesian Linear Mixed Models Using Stan: A Tutorial for Psychologists, Linguists, and Cognitive Scientists.” Quantitative Methods for Psychology 12 (3): 175–200.

Stan Development Team. 2024. “Stan Modeling Language Users Guide and Reference Manual, Version 2.32.” https://mc-stan.org/docs/2_35/.


  1. Whereas in (11.1), \(\alpha\) and \(\beta\) are “broadcasted” into a 2863-vector for ease of exposition, Stan uses vector-scalar arithmetic.↩︎

  2. It often happens that the subject ids are not in sequence in a given data frame. This can happen when, for example, one has lost some subjects due to incomplete data or for some other reason. In such situations, we can re-number the subjects sequentially as follows: type as.numeric(as.factor(df_eeg$subj)) to transform a vector where some numbers are skipped into a vector with consecutive numbers. For example, both as.numeric(as.factor(c(1, 3, 4, 7, 9))) and as.numeric(as.factor(paste0("subj", c(1, 3, 4, 7, 9)))) will give as output 1 2 3 4 5.↩︎

  3. See https://mc-stan.org/misc/warnings.html.↩︎