A Regression models with brms
- Extended
A.1 An efficient function for generating prior predictive distributions in R
As noted in section 3.3, generating prior predictive distributions can be computationally slow if done naively in R with a for-loop: Producing \(1000\) samples of the prior predictive distribution for our model from section 3.3 results in \(361000\) predicted values, which takes a few seconds to compute. Although this approach works, it is not optimal for more complex models or larger datasets.
To address this, one could use a more efficient function using the map2_dfr()
function from the purrr
package, which yields approximately a 10-fold increase in speed. Although the distributions should be the same with both functions, the specific numbers in the tables won’t be, due to the randomness in the process of sampling.
The purrr
function map2_dfr()
(which works similarly to the base R function lapply()
and Map()
) essentially runs a for-loop, and builds a data frame with the output. It iterates over the values of two vectors (or lists) simultaneously, here, mu_samples
and sigma_samples
and, in each iteration, it applies a function to each value of the two vectors, here, mu
and sigma
. The output of each function is a data frame (or tibble in this case) with N_obs
observations which is bound in a larger data frame at the end of the loop. Each of these data frames bound together represents an iteration in the simulation, and we identify the iterations by setting .id = "iter"
.
Although this method for generating prior predictive distributions is a bit involved, it has an advantage in comparison to the more straightforward use of predict()
(or posterior_predict()
, which can also generate prior predictions) together with setting sample_prior = "only"
in the brms
model (as we will do in section 3.7.2). Our method of generating prior predictive distributions does not depend on Stan’s sampler, which means that no matter the number of iterations in our simulation or how uninformative our priors, there will never be any convergence problems.
library(purrr)
# Define the function:
function(mu_samples,
normal_predictive_distribution <-
sigma_samples,
N_obs) {map2_dfr(mu_samples, sigma_samples, function(mu, sigma) {
tibble(trialn = seq_len(N_obs),
t_pred = rnorm(N_obs, mu, sigma))
.id = "iter") %>%
}, # .id is always a string and
# needs to be converted to a number
mutate(iter = as.numeric(iter))
}# Test the timing:
tic()
prior_pred <- normal_predictive_distribution(mu_samples = mu_samples,
sigma_samples = sigma_samples,
N_obs = N_obs)
toc()
## 0.431 sec elapsed
A.2 Truncated distributions
Any distribution can be truncated. For a continuous distribution, the truncated version of the original distribution will have non-zero probability density values for a continuous subset of the original coverage. To make this more concrete, in the example of section 3.5, the normal distribution has coverage for values between minus infinity to plus infinity, and our truncated version \(\mathit{Normal}_+\) has coverage between zero and plus infinity: all negative values have a density of zero. Let’s see how we can generalize this to be able to understand any truncation of any continuous distribution. (For the discrete case, we can simply replace the integral with a sum, and replace PDF with PMF).
From the axiomatic definitions of probability, we know that the area below a PDF, \(f(x)\), must be equal to one (section 1.1). More formally, this means that the integral of \(f\) evaluated as \(f(-\infty <X < \infty)\) should be equal to one:
\[\begin{equation} \int_{-\infty}^{\infty} f(x) dx = 1 \end{equation}\]
But if the distribution is truncated, \(f\) is going to be evaluated in some subset of its possible values, \(f(a <X < b)\); in the specific case of \(\mathit{Normal}_+\), for example, \(a = 0\), and \(b=\infty\). In the general case, this means that the integral of the PDF evaluated for \(a <X < b\) will be lower than one, unless \(a=-\infty\) and \(b=+\infty\).
\[\begin{equation} \int_{a}^{b} f(x) dx < 1 \end{equation}\]
We want to ensure that we build a new PDF for the truncated distribution so that even though it has less coverage than the non-truncated version, it still integrates to one. To achieve this, we divide the “unnormalized” PDF by the total area of \(f(a <X < b)\) (recall the discussion surrounding Equation (1.1)):
\[\begin{equation} f_{[a,b]}(x) = \frac{f(x)}{\int_{a}^{b} f(x) dx} \end{equation}\]
The denominator of the previous equation is the difference between the CDF evaluated at \(X = b\) and the CDF evaluated at \(X =a\); this can be written as \(F(b) - F(a)\):
\[\begin{equation} f_{[a,b]}(x) = \frac{f(x)}{F(b) - F(a)} \tag{A.1} \end{equation}\]
For the specific case where \(f(x)\) is \(Normal(x | 0, \sigma)\) and we want the PDF of \(Normal_+(x | 0, \sigma)\), the bounds will be \(a= 0\) and \(b =\infty\).
\[\begin{equation} Normal_+(x |0, \sigma) = \frac{Normal(x | 0, \sigma)}{1/2} \end{equation}\]
Because \(F(X= b =\infty) = 1\) and \(F(X = a = 0) = 1/2\).
You can verify this in R (this is valid for any value of sd
).
dnorm(1, mean = 0) * 2 == dtnorm(1, mean = 0, a = 0)
## [1] TRUE
Unless the truncation of the normal distribution is symmetrical, the mean \(\mu\) of the truncated normal does not coincide with the mean of the parent (untruncated) normal distribution; call this mean of the parent distribution \(\hat{\mu}\). For any type of truncation, the standard deviation of the truncated distribution \(\sigma\) does not coincide with the standard deviation of the parent distribution; call this latter standard deviation \(\hat\sigma\). Confusingly enough, the family of truncated functions *tnorm()
keeps the names of the arguments of the family of functions *norm()
: mean
and sd
. So, when defining a truncated normal distribution like dtnorm(mean = 300, sd = 100, a = 0, b = Inf)
, the mean
and sd
refer to the mean \(\hat{\mu}\) and standard deviation \(\hat\sigma\) of the untruncated parent distribution.
Sometimes one needs to model observed data as coming from a truncated normal distribution. An example would be a vector of observed standard deviations; perhaps one wants to use these estimates to work out a truncated normal prior. In order to derive such an empirically motivated prior, we have to work out what mean and standard deviation we need to use in a truncated normal distribution. We could compute the mean and standard deviation from the observed vector of standard deviations, and then use the procedure shown below to work out the mean and standard deviation that we would need to put into the truncated normal distribution. This approach is used in online chapter E, section E.1.4 for working out a prior based on standard deviation estimates from existing data.
The mean and standard deviation of the parent distribution of a truncated normal (\(\hat\mu\) and \(\hat\sigma\)) with boundaries \(a\) and \(b\), given the mean \(\mu\) and standard deviation \(\sigma\) of the truncated normal, are computed as follows (Johnson, Kotz, and Balakrishnan 1995). \(\phi(X)\) is the PDF of the standard normal (i.e., \(\mathit{Normal}(\mu=0, \sigma=1)\)) evaluated at \(X\), and \(\Phi(X)\) is the CDF of the standard normal evaluated at \(X\).
First, define two terms \(\alpha\) and \(\beta\) for convenience:
\[\begin{align} \alpha =(a-\hat\mu )/\hat\sigma && \beta =(b-\hat\mu )/\hat\sigma \end{align}\]
Then, the mean \(\mu\) of the truncated distribution can be computed as follows based on the parameters of the parent distribution:
\[\begin{equation} \mu = \hat\mu - \hat\sigma {\frac {\phi (\beta )-\phi (\alpha )}{\Phi (\beta )-\Phi (\alpha )}} \tag{A.2} \end{equation}\]
The variance \(\sigma^2\) of the truncated distribution is:
\[\begin{equation} \sigma^2 = \hat\sigma^2 \times \left( 1 - \frac{\beta \phi (\alpha )-\alpha \phi (\beta )}{\Phi (\beta )-\Phi (\alpha )} - \left(\frac {\phi (\alpha )-\phi (\beta )}{\Phi (\beta )-\Phi (\alpha )}\right)^2 \right) \tag{A.3} \end{equation}\]
Equations (A.2) and (A.3) have two variables, so if one is given the values for the truncated distribution \(\mu\) and \(\sigma\), one can solve (using algebra) for the mean and standard deviation of the untruncated distribution, \(\hat\mu\) and \(\hat\sigma\).
For example, suppose that \(a=0\) and \(b=500\), and that the mean and standard deviation of the untruncated parent distribution is \(\hat\mu=300\) and \(\hat\sigma=200\). We can simulate such a situation and estimate the mean and standard deviation of the truncated distribution:
rtnorm(10000000, mean = 300, sd = 200, a = 0, b = 500)
x <-## the mean and sd of the truncated distributions
## using simulation:
mean(x)
## [1] 271
sd(x)
## [1] 129
These simulated values are identical to the values computed using equations (A.2) and (A.3):
0
a <- 500
b <- 300
bar_x <- 200
bar_sigma <- (a - bar_x) / bar_sigma
alpha <- (b - bar_x) / bar_sigma
beta <- ((dnorm(beta) - dnorm(alpha)) /
term1 <- (pnorm(beta) - pnorm(alpha)))
((beta * dnorm(beta) - alpha * dnorm(alpha)) /
term2 <- (pnorm(beta) - pnorm(alpha)))
## the mean and sd of the truncated distribution
## computed analytically:
bar_x - bar_sigma * term1) (mu <-
## [1] 271
sqrt(bar_sigma^2 * (1 - term2 - term1^2))) (sigma <-
## [1] 129
The equations for the mean and variance of the truncated distribution (\(\mu\) and \(\sigma\)) can also be used to work out the mean and variance of the parent untruncated distribution (\(\hat\mu\) and \(\hat\sigma\)), if one has estimates for \(\mu\) and \(\sigma\) (from data).
Suppose that we have observed data with mean \(\mu = 271\) and \(\sigma=129\). We want to assume that the data are coming from a truncated normal which has lower bound \(0\) and upper bound \(500\). What are the mean and standard deviation of the parent distribution, \(\hat\mu\) and \(\hat\sigma\)?
To answer this question, first rewrite the equations as follows:
\[\begin{equation} \mu - \hat\mu + \hat\sigma {\frac {\phi (\beta )-\phi (\alpha )}{\Phi (\beta )-\Phi (\alpha )}} = 0 \tag{A.4} \end{equation}\]
\[\begin{equation} \sigma^2 - \hat\sigma^2 \times \left( 1 - \frac{\beta \phi (\alpha )-\alpha \phi (\beta )}{\Phi (\beta )-\Phi (\alpha )} - \left(\frac {\phi (\alpha )-\phi (\beta )}{\Phi (\beta )-\Phi (\alpha )}\right)^2 \right) = 0 \tag{A.5} \end{equation}\]
Next, solve for \(\hat\mu\) and \(\hat\sigma\) given the observed mean and the standard deviation of the truncated distribution, and that one knows the boundaries (\(a\), and \(b\)).
Define the system of equations according to the specifications of multiroot()
from the package rootSolve
: x
for the unknowns (\(\hat\mu\) and \(\hat\sigma\)), and parms
for the known parameters: \(a\), \(b\), and the mean and standard deviation of the truncated normal.
function(x, parms) {
eq_system <- x[1]
mu_hat <- x[2]
sigma_hat <- (parms["a"] - mu_hat) / sigma_hat
alpha <- (parms["b"] - mu_hat) / sigma_hat
beta <-c(F1 = parms["mu"] - mu_hat + sigma_hat *
(dnorm(beta) - dnorm(alpha)) / (pnorm(beta) - pnorm(alpha)),
F2 = parms["sigma"] - sigma_hat *
sqrt((1 - ((beta) * dnorm(beta) - (alpha) * dnorm(alpha)) /
(pnorm(beta) - pnorm(alpha)) - ((dnorm(beta) - dnorm(alpha)) /
(pnorm(beta) - pnorm(alpha)))^2)))
}
Solving the two equations using multiroot()
from the package rootSolve
gives us the mean and standard deviation \(\hat\mu\) and \(\hat\sigma\) of the parent normal distribution. (Notice that x
is a required parameter of the previous function so that it works with multiroot()
, however, outside of the function the variable x
is a vector containing the samples of the truncated normal distribution generated with rtnorm()
).
multiroot(f = eq_system,
soln <-start = c(1, 1),
parms = c(a = 0,
b = 500,
mu = mean(x),
sigma = sd(x)))
$root soln
## [1] 300 200
The function compute_meansd_parent()
encapsulates the previous procedure and it is provided in the bcogsci
package.
A.3 Intercepts in brms
When we set up a prior for the intercept in brms
, we actually set a prior for an intercept assuming that all the predictors are centered. This means that when predictors are not centered (and only then), there is a mismatch between the interpretation of the intercept as returned in the output of brms
and the interpretation of the intercept with respect to its prior specification. In this case, only the intercept in the output corresponds to the formula in the brms
call, that is, the intercept in the output corresponds to the non-centered model. However, as we show below, when the intercept is much larger than the effects that we are considering in the formula (what we generally call \(\beta\)), this discrepancy hardly matters.
The reason for this mismatch when our predictors are uncentered is that brms
increases sampling efficiency by automatically centering all the predictors internally (that is the population-level design matrix \(X\) is internally centered around its column means when brms
fits a model). This did not matter in our previous examples because we centered our predictor (or we had no predictor), but it might matter if we want to have uncentered predictors. In the design we are discussing, a non-centered predictor of load will mean that the intercept, \(\alpha\), has a straightforward interpretation: the \(\alpha\) is the mean pupil size when there is no attention load. This is in contrast with the centered version presented before, where the intercept \(\alpha\) represents the pupil size for the average load of 2.44
(c_load
is equal to 0). The difference between the non-centered model (below) and the centered version presented before is depicted in Figure A.1.
Suppose that we are quite sure that the prior values for the no load condition (i.e., load is non-centered) fall between \(400\) and \(1200\) ms. In that case, the following prior could be set for \(\alpha\): \(\mathit{Normal}(800,200)\). In this case, the model becomes:
prior_nc <- c(prior(normal(800, 200), class = b, coef = Intercept),
prior(normal(0, 1000), class = sigma),
prior(normal(0, 100), class = b, coef = load))
brm(p_size ~ 0 + Intercept + load,
fit_pupil_non_centered <-data = df_pupil,
family = gaussian(),
prior = prior_nc)
When the predictor is non-centered as shown above, the regular centered intercept is removed by adding 0
to the formula, and by replacing the intercept with the “actual” intercept we want to set priors on with Intercept
. The word Intercept
is a reserved word; we cannot name any predictor with this name. This new parameter is also of class b
, so its prior needs to be defined accordingly. Once we use 0 + Intercept + ..
, the intercept is not calculated with predictors that are automatically centered any more.
The output below shows that, as expected, although the posterior for the intercept has changed noticeably, the posterior for the effect of load remains virtually unchanged.
posterior_summary(fit_pupil_non_centered,
variable = c("b_Intercept", "b_load"))
## Estimate Est.Error Q2.5 Q97.5
## b_Intercept 624.0 34.4 558.54 693.7
## b_load 32.3 11.6 8.58 54.6
Notice the following potential pitfall. A model like the one below will fit a non-centered load predictor, but will assign a prior of \(\mathit{Normal}(800,200)\) to the intercept of a model that assumes a centered predictor, \(\alpha_{centered}\), and not the current intercept, \(\alpha\).
prior_nc <- c(prior(normal(800, 200), class = Intercept),
prior(normal(0, 1000), class = sigma),
prior(normal(0, 100), class = b, coef = load))
brm(p_size ~ 1 + load,
fit_pupil_wrong <-data = df_pupil,
family = gaussian(),
prior = prior_nc)
What does it mean to set a prior to \(\alpha_{centered}\) in a model that doesn’t include \(\alpha_{centered}\)?
The fitted (expected) values of the non-centered model and the centered one are identical, that is, the values of the response distribution without the residual error are identical for both models:
\[\begin{equation} \alpha + load_n \cdot \beta = \alpha_{centered} + (load_n - mean(load)) \cdot \beta \tag{A.6} \end{equation}\]
The left side of Equation (A.6) refers to the expected values based on our current non-centered model, and the right side refers to the expected values based on the centered model. We can re-arrange terms to understand what the effect is of a prior on \(\alpha_{centered}\) in our model that doesn’t include \(\alpha_{centered}\).
\[\begin{equation} \begin{aligned} \alpha + load_n \cdot \beta &= \alpha_{centered} + load_n\cdot \beta - mean(load) \cdot \beta\\ \alpha &= \alpha_{centered} - mean(load) \cdot \beta\\ \alpha + mean(load) \cdot \beta &= \alpha_{centered} \end{aligned} \end{equation}\]
That means that in the non-centered model, we are actually setting our prior to \(\alpha + mean(load) \cdot \beta\). When \(\beta\) is very small (or the means of our predictors are very small because they might be “almost” centered), and the prior for \(\alpha\) is very wide, we might hardly notice the difference between setting a prior to \(\alpha_{centered}\) or to our actual \(\alpha\) in a non-centered model (especially if the likelihood dominates anyway). But it is important to pay attention to what the parameters represent that we are setting priors on.
To sum up, brms
automatically centers all predictors for posterior estimation, and the prior of the intercept is applied to the centered version of the model during model fitting. However, when the predictors specified in the formula are not centered, then brms
uses the equations shown before to return in the output the posterior of the intercept for the non-centered predictors.70
In our example analyses with brms
in this book, we will always center our predictors.
FIGURE A.1: Regression lines for the non-centered and centered linear regressions. The intercept (\(\alpha\)) represented by a circle is positioned differently depending on the centering, whereas the slope (\(\beta\)) represented by a vertical dashed line has the same magnitude in both models.
A.4 Understanding the log-normal likelihood
It is important to understand what we are assuming with a log-normal likelihood. Formally, if a random variable \(Z\) is normally distributed with mean \(\mu\) and variance \(\sigma^2\), then the transformed random variable \(Y = \exp(Z)\) is log-normally distributed and has density:
\[\begin{equation} \mathit{LogNormal}(y|\mu,\sigma)=f(z)= \frac{1}{\sqrt{2\pi \sigma^2}y} \exp \left(-\frac{(\log(y)-\mu)^2}{2\sigma^2} \right) \end{equation}\]
As explained in section 3.7.1, the model from Equation (4.1) is equivalent to the following:
\[\begin{equation} \log(t_n) \sim \mathit{Normal}(\alpha + c\_trial_n \cdot \beta,\sigma)\\ \tag{A.7} \end{equation}\]
The family of normal distributions is closed under linear transformations: that is, if \(X\) is normally distributed with mean \(\mu\) and standard deviation \(\sigma\), then (for any real numbers \(a\) and \(b\)), \(a X + b\) is also normally distributed, with mean \(a \mu +b\) (and standard deviation \(\sqrt{a^2\sigma^2}=|a|\sigma\)).
This means that, assuming \(Z \sim \mathit{Normal}(\alpha, \sigma)\), Equation (A.7) can be re-written as follows:
\[\begin{equation} \log(rt_n) = Z + c\_trial_n \cdot \beta \tag{A.8} \end{equation}\]
Exponentiate both sides, and use the property of exponents that \(\exp(x+y)\) is equal to \(\exp(x) \cdot \exp(y)\); set \(Y =\exp(Z)\).
\[\begin{equation} \begin{aligned} rt_n &= \exp \big(Z + c\_trial_n \cdot \beta\big) \\ rt_n &= \exp(Z ) \cdot \exp\big(c\_trial_n \cdot \beta\big) \\ rt_n &= Y \cdot \exp\big(c\_trial_n \cdot \beta\big) \end{aligned} \end{equation}\]
The last equation has two terms being multiplied, the first one, \(Y\), is telling us that we are assuming that finger tapping times are log-normally distributed with a median of \(\exp(\alpha)\), the second term, \(\exp(c\_trial_n \cdot \beta)\) is telling us that the effect of trial number is multiplicative and grows or decays exponentially with the trial number. This has two important consequences:
Different values of the intercept, \(\alpha\), given the same \(\beta\), will affect the difference in finger tapping or response times for two adjacent trials (compare this with what happens with an additive model, such as when a normal likelihood is used); see Figure A.2. This is because, unlike in the additive case, the intercept doesn’t cancel out:
Additive case:
\[\begin{equation} \begin{aligned} & (\alpha + trial_n \cdot \beta) - (\alpha + trial_{n-1} \cdot \beta) = \\ &=\alpha -\alpha + ( trial_n - trial_{n-1} ) \cdot \beta\\ &= ( trial_n - trial_{n-1} ) \cdot \beta \end{aligned} \end{equation}\]
Multiplicative case:
\[\begin{equation} \begin{aligned} &\exp(\alpha) \cdot \exp(trial_n \cdot \beta) -\exp(\alpha) \cdot \exp(trial_{n-1} \cdot \beta) =\\ &= \exp(\alpha) \big(\exp(trial_n \cdot \beta) - \exp(trial_{n-1}\cdot \beta) \big)\\ &\neq \big(\exp(trial_n) - \exp(trial_{n-1}) \big) \cdot \exp(\beta) \end{aligned} \end{equation}\]
As the trial number increases, the same value of \(\beta\) will have a very different impact on the original scale of the dependent variable: Any (fixed) negative value for \(\beta\) will lead to exponential decay and any (fixed) positive value will lead to exponential growth; see Figure A.3.
FIGURE A.2: The fitted values of the difference in response time between two adjacent trials, when \(\beta=0.01\) and \(\alpha\) lies between 0.1 and 15. The graph shows how changes in the intercept lead to changes in the difference in response times between trials, even if \(\beta\) is fixed.
FIGURE A.3: The fitted values of the dependent variable (response times in ms) as a function of trial number, when (A) \(\beta = -0.01\), exponential decay, and when (B) \(\beta =0.01\), exponential growth.
Does exponential growth or decay make sense in this particular example? We need to consider that if they do make sense, they will be an approximation valid for a specific range of values, at some point we will expect a ceiling or a floor effect: response times cannot truly be 0 milliseconds, or take several minutes. However, in our specific model, exponential growth or decay by trial is probably a bad approximation: We will predict that our subject will take extremely long (if \(\beta >0\)) or extremely short (if \(\beta <0\)) time in pressing the space bar in a relatively low number of trials. This doesn’t mean that the likelihood is wrong by itself, but it does mean that at least we need to put a cap on the growth or decay of our experimental manipulation. We can do this if the exponential growth or decay is a function of, for example, log-transformed trial numbers:
\[\begin{equation} t_n \sim \mathit{LogNormal}(\alpha + c\_\log\_trial_n \cdot \beta,\sigma)\\ \end{equation}\]
FIGURE A.4: Fitted value of the dependent variable (times in ms) as function of the natural logarithm of the trial number, when (A) \(\beta=-0.01\), exponential decay, and when (B) \(\beta =.01\), exponential growth.
A.4.1 Log-normal distributions everywhere
The normal distribution is most often assumed to describe the random variation that occurs in the data from many scientific disciplines. However, most measurements actually show skewed distributions. Limpert, Stahel, and Abbt (2001) discuss the log-normal distribution in scientific disciplines and how diverse type of data, from lengths of latent periods of infectious diseases to distribution of mineral resources in the Earth’s crust, including even body height–the quintessential example of a normal distribution–closely fit the log-normal distribution.
Limpert, Stahel, and Abbt (2001) point out that because a random variable that results from multiplying many independent variables has an approximate log-normal distribution, the most basic indicator of the importance of the log-normal distribution may be very general: Chemistry and physics are fundamental in life, and the prevailing operation in the laws of these disciplines is multiplication rather than addition.
Furthermore, at many physiological and anatomical levels in the brain, the distribution of numerous parameters is in fact strongly skewed with a heavy tail, suggesting that skewed (typically log-normal) distributions are fundamental to structural and functional brain organization. This might be explained given that the majority of interactions in highly interconnected systems, especially in biological systems, are multiplicative and synergistic rather than additive (Buzsáki and Mizuseki 2014).
Does the log-normal distribution make sense for response times? It has been long noticed that the log-normal distribution often provides a good fit to response times distributions (Brée 1975; Ulrich and Miller 1994). One advantage of assuming log-normally distributed response times (but, in fact, this is true for many skewed distributions) is that it entails that the standard deviation of the response time distribution will increase with the mean, as has been observed in empirical distributions of response times (Wagenmakers, Grasman, and Molenaar 2005). Interestingly, it turns out that log-normal response times are also easily generated by certain process models. Ulrich and Miller (1993) show, for example, that models in which response times are determined by a series of processes cascading activation from an input level to an output level (usually passing through a number of intervening processing levels along the way) can generate log-normally distributed response times.
A.5 Prior predictive checks in R
The following function is an edited version of the earlier normal_predictive_distribution
from the online section A.1, which was used in section 3.3; it has been edited to make it compatible with logistic regression and dependent on set size.
As we did before, our custom function uses the purrr
function map2_dfr()
, which runs an efficient for-loop, iterating over two vectors (here alpha_samples
and beta_samples
), and builds a data frame with the output.
function(alpha_samples,
logistic_model_pred <-
beta_samples,
set_size,
N_obs) {map2_dfr(alpha_samples, beta_samples,
function(alpha, beta) {
tibble(set_size = set_size,
# center size:
c_set_size = set_size - mean(set_size),
# change the likelihood:
# Notice the use of a link function
# for alpha and beta
theta = plogis(alpha + c_set_size * beta),
correct_pred = rbern(N_obs, prob = theta))
},.id = "iter") %>%
# .id is always a string and needs
# to be converted to a number
mutate(iter = as.numeric(iter))
}
Let’s assume 800 observations with 200 observation for each set size:
800
N_obs <- rep(c(2, 4, 6, 8), 200) set_size <-
Now, iterate over plausible standard deviations of \(\beta\) with the purrr
function map_dfr()
, which iterates over one vector (here sds_beta
), and also builds a data frame with the output.
rnorm(1000, 0, 1.5)
alpha_samples <- c(1, 0.5, 0.1, 0.01, 0.001)
sds_beta <- map_dfr(sds_beta, function(sd) {
prior_pred <- rnorm(1000, 0, sd)
beta_samples <-logistic_model_pred(alpha_samples = alpha_samples,
beta_samples = beta_samples,
set_size = set_size,
N_obs = N_obs) %>%
mutate(prior_beta_sd = sd)
})
Calculate the accuracy for each one of the priors we want to examine, for each iteration, and for each set size.
mean_accuracy <- prior_pred %>%
group_by(prior_beta_sd, iter, set_size) %>%
summarize(accuracy = mean(correct_pred)) %>%
mutate(prior = paste0("Normal(0, ", prior_beta_sd, ")"))
The plot of the accuracy is shown in Figure 4.13 of the book, and repeated here in Figure A.5.
%>%
mean_accuracy ggplot(aes(accuracy)) +
geom_histogram() +
facet_grid(set_size ~ prior) +
scale_x_continuous(breaks = c(0, .5, 1))
FIGURE A.5: The prior predictive distributions of mean accuracy of the model defined in section 4.3, for different set sizes and different priors for \(\beta\).
It’s sometimes more useful to look at the predicted differences in accuracy between set sizes. We calculate them as follows, and plot them in Figure 4.14 of the book, repeated here in Figure A.6.
mean_accuracy %>%
diff_accuracy <- arrange(set_size) %>%
group_by(iter, prior_beta_sd) %>%
mutate(diff_accuracy = accuracy - lag(accuracy)) %>%
mutate(diffsize = paste(set_size, "-", lag(set_size))) %>%
filter(set_size > 2)
%>%
diff_accuracy ggplot(aes(diff_accuracy)) +
geom_histogram() +
facet_grid(diffsize ~ prior) +
scale_x_continuous(breaks = c(-.5, 0, .5))
FIGURE A.6: The prior predictive distributions of differences in mean accuracy between set sizes of the model defined in section 4.3 for different priors for \(\beta\).
A.6 Finitely exchangeable random variables
Formally, we say that the random variables \(Y_1,\dots,Y_N\) are finitely exchangeable if, for any set of particular outcomes of an experiment \(y_1,\dots,y_N\), the probability \(p(y_1,\dots,y_N)\) that we assign to these outcomes is unaffected by permuting the labels given to the variables. In other words, for any permutation \(\pi(n)\), where \(n=1,\dots,N\) (\(\pi\) is a function that takes as input the positive integer \(n\) and returns another positive integer; e.g., the function takes a subject indexed as 1, and returns index 3), we can reasonably assume that \(p(y_1,\dots,y_N)=p(y_{\pi(1)},\dots,y_{\pi(N)})\). A simple example is a coin tossed twice. Suppose the first coin toss is \(Y_1=1\), a heads, and the second coin toss is \(Y_2=0\), a tails. If we are willing to assume that the probability of getting one heads is unaffected by whether it appears in the first or the second toss, i.e., \(p(Y_1=1,Y_2=0)=p(Y_1=0,Y_2=1)\), then we assume that the indices are exchangeable.
Some important connections and differences between exchangeability and the frequentist concept of independent and identically distributed (iid):
If the data are exchangeable, they are not necessarily iid. For example, suppose you have a box with one black ball and two red balls in it. Your task is to repeatedly draw (without replacement) a ball at random. Suppose that in your first draw, you draw one ball and get the black ball. The probability of getting a black ball in the next two draws is now \(0\). However, if in your first draw you had retrieved a red ball, then there is a non-zero probability of drawing a black ball in the next two draws. The outcome in the first draw affects the probability of subsequent draws–they are not independent. But the sequence of random variables is exchangeable. To see this, consider the following: If a red ball is drawn, count it as a \(0\), and if a black ball is drawn, then count it as \(1\). Then, the three possible outcomes and the probabilities are
- \(1,0,0\); \(P(X_1=1,X_2=0,X_3=0) = \frac{1}{3} \times 1 \times 1=\frac{1}{3}\)
- \(0,1,0\) \(P(X_1=0,X_2=1,X_3=0) = \frac{2}{3} \times \frac{1}{2} \times 1=\frac{1}{3}\)
- \(0,0,1\) \(P(X_1=0,X_2=0,X_3=1) = \frac{2}{3} \times \frac{1}{2} \times 1=\frac{1}{3}\)
The random variables \(X_1,X_2,X_3\) can be permuted and the joint probability distribution (technically, the PMF) is the same in each case.
If the data are exchangeable, then they are identically distributed. For example, in the box containing one black ball and two red balls, suppose we count the draw of a black ball as a \(1\), and the draw of a red ball as a \(0\). Then the probability \(P(X_1=1)=\frac{1}{3}\) and \(P(X_1=0)=\frac{2}{3}\); this is also true for \(X_2\) and \(X_3\). That is, these random variables are identically distributed.
If the data are iid in the standard frequentist sense, then they are exchangeable. For example, suppose you have \(i=1,\dots,n\) instances of a random variable \(X\) whose PDF is \(f(x)\). Suppose also that \(X_i\) are iid. The joint PDF (this can be discrete or continuous, i.e., a PMF or PDF) is
\[\begin{equation} f_{X_1,\dots,X_n}(x_1,\dots,x_n) = f(x_1) \cdot \dots \cdot f(x_n) \end{equation}\]
Because the terms on the right-hand side can be permuted, the labels can be permuted on any of the \(x_i\). This means that \(X_1,\dots,X_n\) are exchangeable.
A.7 The Matrix Formulation of Hierarchical Models (the Laird-Ware form)
In the book, we generally write linear models as follows; where \(n\) refers to the row id in the data frame.
\[\begin{equation} y_n \sim \mathit{Normal}(\alpha + \beta\cdot x_n) \end{equation}\]
This simple linear model can be re-written as follows:
\[\begin{equation} y_n = \alpha + \beta\cdot x_n + \varepsilon_n \end{equation}\]
where \(\varepsilon_n \sim \mathit{Normal}(0, \sigma)\).
The model does not change if \(\alpha\) is multiplied by \(1\):
\[\begin{equation} y_n = \alpha\cdot 1 + \beta\cdot x_n + \varepsilon_n \end{equation}\]
The above is actually \(n\) linear equations, and can be written compactly in matrix form:
\[\begin{equation} {\begin{pmatrix} y_1\\ y_2\\ \vdots \\ y_n\\ \end{pmatrix}} = {\begin{pmatrix} 1 & x_1\\ 1 & x_2\\ \vdots & \vdots \\ 1 & x_n\\ \end{pmatrix}} {\begin{pmatrix} \alpha\\ \beta \\ \end{pmatrix}} + {\begin{pmatrix} \varepsilon_1 \\ \varepsilon_2 \\ \vdots \\ \varepsilon_n \\ \end{pmatrix}} \end{equation}\]
Consider this matrix in the above equation:
\[\begin{equation} {\begin{pmatrix} 1 & x_1\\ 1 & x_2\\ \vdots & \vdots \\ 1 & x_n\\ \end{pmatrix}} \end{equation}\]
This matrix is called the model matrix or the design matrix; we will encounter it again in the contrast coding chapters, where it plays a crucial role. If we write the dependent variable \(y\) as a \(n\times 1\) vector, the above matrix as the matrix \(X\) (which has dimensions \(n \times 2\)), the intercept and slope parameters as a \(2\times 1\) matrix \(\zeta\), and the residual errors as an \(n\times 1\) matrix, we can write the linear model very compactly:
\[\begin{equation} y = X \zeta + \varepsilon \end{equation}\]
The above matrix formulation of the linear model extends to the hierarchical model very straightforwardly. For example, consider the by-subjects and by-items correlated varying intercept varying slopes model \(M_{sih}\) that we saw in section 5.2.5. This model has the following likelihood:
\[\begin{multline} signal_n \sim \mathit{Normal}(\alpha + u_{subj[n],1} + w_{item[n],1} \\ + c\_cloze_n \cdot (\beta + u_{subj[n],2}+ w_{item[n],2}), \sigma) \end{multline}\]
The terms in the location parameter in the normal likelihood can be re-written in matrix form, just like the linear model above. To see this, consider the fact that the location term
\[\begin{equation} \alpha + u_{subj[n],1} + w_{item[n],1} + c\_cloze_n \cdot (\beta + u_{subj[n],2}+ w_{item[n],2}) \end{equation}\]
can be re-written as
\[\begin{multline} \alpha\cdot 1 + u_{subj[n],1}\cdot 1 + w_{item[n],1}\cdot 1 +\\ \beta \cdot c\_cloze_n + u_{subj[n],2}\cdot c\_cloze_n+ w_{item[n],2}\cdot c\_cloze_n \end{multline}\]
The above equation can in turn be written in matrix form as follows. The symbol \(\odot\) is the Hadamard product: this is cell-wise multiplication rather than matrix multiplication.71
\[\begin{equation} \begin{aligned} & {\begin{pmatrix} 1 & c\_cloze_1\\ 1 & c\_cloze_2\\ \vdots & \vdots \\ 1 & c\_cloze_n\\ \end{pmatrix}} {\begin{pmatrix} \alpha\\ \beta \\ \end{pmatrix}} + \\ & {\begin{pmatrix} 1 & c\_cloze_1\\ 1 & c\_cloze_2\\ \vdots & \vdots \\ 1 & c\_cloze_n\\ \end{pmatrix}} \odot {\begin{pmatrix} u_{subj[1],1} & u_{subj[1],2} \\ u_{subj[2],1} & u_{subj[2],2} \\ \vdots & \vdots\\ u_{subj[n],1} & u_{subj[n],2}\\ \end{pmatrix}} +\\ & {\begin{pmatrix} 1 & c\_cloze_1\\ 1 & c\_cloze_2\\ \vdots & \vdots \\ 1 & c\_cloze_n\\ \end{pmatrix}} \odot {\begin{pmatrix} w_{item[1],1} & w_{item[1],2}\\ w_{item[2],1} & w_{item[2],2} \\ \vdots & \vdots \\ w_{item[n],1} & w_{item[n],2} \\ \end{pmatrix}} \end{aligned} \end{equation}\]
In this hierarchical model, there are three model matrices:
- the model matrix associated with the intercept \(\alpha\) and the slope \(\beta\); below, we call this the matrix \(X\).
- the model matrix associated with the by-subject varying intercepts and slopes; call this the matrix \(Z_u\).
- the model matrix associated with the by-item varying intercepts and slopes; call this the matrix \(Z_w\).
The model can now be written very compactly in matrix form by writing these three matrices as follows:
\[\begin{equation} \begin{aligned} X = & {\begin{pmatrix} 1 & c\_cloze_1\\ 1 & c\_cloze_2\\ \vdots & \vdots \\ 1 & c\_cloze_n\\ \end{pmatrix}}\\ Z_u = & {\begin{pmatrix} 1 & c\_cloze_1\\ 1 & c\_cloze_2\\ \vdots & \vdots \\ 1 & c\_cloze_n\\ \end{pmatrix}} \\ Z_w = & {\begin{pmatrix} 1 & c\_cloze_1\\ 1 & c\_cloze_2\\ \vdots & \vdots \\ 1 & c\_cloze_n\\ \end{pmatrix}} \\ \end{aligned} \end{equation}\]
The location part of the model \(M_{sih}\) can now be written very compactly:
\[\begin{equation} X \zeta + Z_u \odot z_u + Z_w \odot z_w \end{equation}\]
Here, \(\zeta\) is a \(2\times 1\) matrix containing the intercept \(\alpha\) and the slope \(\beta\), and \(z_u\) and \(z_w\) are the intercept and slope adjustments by subject and by item:
\[\begin{equation} \begin{aligned} z_u = & {\begin{pmatrix} u_{subj[1],1} & u_{subj[1],2} \\ u_{subj[2],1} & u_{subj[2],2} \\ \vdots & \vdots\\ u_{subj[n],1} & u_{subj[n],2}\\ \end{pmatrix}}\\ z_w = & {\begin{pmatrix} w_{item[1],1} & w_{item[1],2}\\ w_{item[2],1} & w_{item[2],2} \\ \vdots & \vdots \\ w_{item[n],1} & w_{item[n],2} \\ \end{pmatrix}}\\ \end{aligned} \end{equation}\]
In summary, the hierarchical model has a very general matrix formulation (cf. Laird and Ware 1982):
\[\begin{equation} signal = X \zeta + Z_u \odot z_u + Z_w \odot z_w + \varepsilon \end{equation}\]
The practical relevance of this matrix formulation is that we can define hierarchical models very compactly and efficiently in Stan by expressing the model in terms of the model matrices (Sorensen, Hohenstein, and Vasishth 2016). As an aside, notice that in the above example, \(X=Z_u=Z_w\); but in principle one could have different model matrices for the fixed vs. random effects.
A.8 Treatment contrast with intercept as the grand mean
In chapter 6, we have introduced the treatment contrast, where each contrast compares one condition to a baseline condition. We have discussed that the intercept in the treatment contrast estimates the condition mean for the baseline condition. There are some applications where this behavior may seem sub-optimal. This can be the case in experimental designs with multiple factors, where we may want to use centered contrasts (this is discussed in chapter 7). Moreover, the contrast coding of the population-level (or fixed) effects also defines what the group-level (or random) effects assess. If the intercept assesses the grand mean–rather than the baseline condition–in hierarchical models, then the group-level intercepts reflect the grand mean variance, rather than the variance in the baseline condition.
It is possible to design a treatment contrast where the intercept reflects the grand mean (assuming a balanced design; otherwise it’s the unweighted grand mean). We implement this using the hypr
package. The trick is to add the intercept explicitly as a comparison of the average of all four condition means:
hypr(b0 = ~ (F1 + F2 + F3 + F4) / 4,
HcTrGM <-b1 = F2 ~ F1,
b2 = F3 ~ F1,
b3 = F4 ~ F1,
levels = c("F1", "F2", "F3", "F4"))
HcTrGM
## hypr object containing 4 null hypotheses:
## H0.b0: 0 = (F1 + F2 + F3 + F4)/4 (Intercept)
## H0.b1: 0 = F2 - F1
## H0.b2: 0 = F3 - F1
## H0.b3: 0 = F4 - F1
##
## Call:
## hypr(b0 = ~1/4 * F1 + 1/4 * F2 + 1/4 * F3 + 1/4 * F4, b1 = ~F2 -
## F1, b2 = ~F3 - F1, b3 = ~F4 - F1, levels = c("F1", "F2",
## "F3", "F4"))
##
## Hypothesis matrix (transposed):
## b0 b1 b2 b3
## F1 1/4 -1 -1 -1
## F2 1/4 1 0 0
## F3 1/4 0 1 0
## F4 1/4 0 0 1
##
## Contrast matrix:
## b0 b1 b2 b3
## F1 1 -1/4 -1/4 -1/4
## F2 1 3/4 -1/4 -1/4
## F3 1 -1/4 3/4 -1/4
## F4 1 -1/4 -1/4 3/4
The hypothesis matrix now explicitly codes the intercept as the first column, where all hypothesis weights are equal and sum up to one. This is coding the intercept. The other hypothesis weights are as expected for the treatment contrast. The contrast matrix now looks very different compared to the standard treatment contrast. Next, we fit a model with this adapted treatment contrast. The function contr.hypothesis
automatically removes the intercept that is encoded in HcTrGM
, since this is automatically added by brms
.
contrasts(df_contrasts3$F) <- contr.hypothesis(HcTrGM)
brm(DV ~ F,
fit_TrGM <-data = df_contrasts3,
family = gaussian(),
prior = c(prior(normal(20, 50), class = Intercept),
prior(normal(0, 50), class = sigma),
prior(normal(0, 50), class = b)))
fixef(fit_TrGM)
## Estimate Est.Error Q2.5 Q97.5
## Intercept 20.01 2.38 15.07 24.7
## Fb1 9.48 6.75 -4.47 22.3
## Fb2 -0.42 6.75 -14.04 12.8
## Fb3 29.25 6.81 15.41 42.4
The results show that the coefficients reflect comparisons of each condition \(F2\), F3, and F4 to the baseline condition \(F1\). The intercept now captures the grand mean across all four conditions of \(20\).
References
Brée, David S. 1975. “The Distribution of Problem-Solving Times: An Examination of the Stages Model.” British Journal of Mathematical and Statistical Psychology 28 (2): 177–200. https://doi.org/10/cnx3q7.
Buzsáki, György, and Kenji Mizuseki. 2014. “The Log-Dynamic Brain: How Skewed Distributions Affect Network Operations.” Nature Reviews Neuroscience 15 (4): 264–78. https://doi.org/10.1038/nrn3687.
Johnson, Norman L., Samuel Kotz, and Narayanaswamy Balakrishnan. 1995. Continuous Univariate Distributions, Volume 2. Vol. 289. John Wiley; Sons.
Laird, Nan M., and James H. Ware. 1982. “Random-Effects Models for Longitudinal Data.” Biometrics, 963–74. https://doi.org/https://doi.org/10.2307/2529876.
Limpert, Eckhard, Werner A. Stahel, and Markus Abbt. 2001. “Log-Normal Distributions Across the Sciences: Keys and Clues.” BioScience 51 (5): 341. https://doi.org/10.1641/0006-3568(2001)051[0341:LNDATS]2.0.CO;2.
Sorensen, Tanner, Sven Hohenstein, and Shravan Vasishth. 2016. “Bayesian Linear Mixed Models Using Stan: A Tutorial for Psychologists, Linguists, and Cognitive Scientists.” Quantitative Methods for Psychology 12 (3): 175–200.
Ulrich, Rolf, and Jeff Miller. 1993. “Information Processing Models Generating Lognormally Distributed Reaction Times.” Journal of Mathematical Psychology 37 (4): 513–25. https://doi.org/10.1006/jmps.1993.1032.
Ulrich, Rolf, and Jeff Miller. 1994. “Effects of Truncation on Reaction Time Analysis.” Journal of Experimental Psychology: General 123 (1): 34–80. https://doi.org/10/b8tsnh.
Wagenmakers, Eric-Jan, Raoul P. P. P. Grasman, and Peter C. M. Molenaar. 2005. “On the Relation Between the Mean and the Variance of a Diffusion Model Response Time Distribution.” Journal of Mathematical Psychology 49 (3): 195–204. https://doi.org/10.1016/j.jmp.2005.02.003.
These transformations are visible when checking the generated Stan code using
make_stancode()
.↩︎This means that if we have two matrices \(A_{i,j}\) and \(B_{i,j}\), the Hadamard product produces a matrix \(C_{i,j}\) that is the result of multiplying each cell \(A_{i,j}\) with \(B_{i,j}\), for all row and column ids \(i,j\) respectively.↩︎