Chapter 8 Contrast coding

Whenever one uses a categorical factor as a predictor in a Bayesian regression model, for example when estimating the difference in a dependent variable between two or three experimental conditions, it becomes necessary to code the discrete factor levels into numeric predictor variables. This coding is termed contrast coding. For example, in the previous chapter (section 5.3), we coded two experimental conditions as \(-1\) and \(+1\), i.e., implementing a sum contrast. Those contrasts are the values that we assign to predictor variables to encode specific comparisons between factor levels and to create predictor terms to estimate these comparisons in any type of regression, including Bayesian regressions.

Contrast coding in Bayesian models works more or less the same way as in frequentist models, and the same principles and tools can be used in both cases. As one important difference, the prior can matter in Bayesian models (especially when working with Bayes factors: Dablander et al. 2022; Rouder et al. 2012); it needs to be adapted to the scaling of contrasts. For example, given two conditions in an experiment, if we code a sum contrast (discussed below) with \(\pm 1\) coding, the parameter representing the difference between the two conditions will be half as large as the observed difference in the data, compared to the case when we code the contrasts with \(\pm 0.5\) coding. In the latter case, the parameter of interest does represent the observed difference in means between the conditions. The two contrast coding options (\(\pm 1\) and \(\pm 0.5\)) obviously imply different prior specifications for the parameter of interest.

This chapter will introduce contrast coding in the context of Bayesian models. The descriptions are in large parts taken from Schad et al. (2020) (which is published under a CC-BY 4.0 license: https://creativecommons.org/licenses/by/4.0/) and adapted for the current context.

Consider a situation where we want to estimate differences in a dependent variable between three factor levels. An example could be differences in response times between three levels of word class (noun, verb, adjective). We might be interested in whether word class influences response times. In frequentist statistics, one way to approach this question would be to run an ANOVA and compute an omnibus \(F\)-test for whether word class explains response times. A Bayesian equivalent to the frequentist omnibus \(F\)-test is Bayesian model comparison (i.e., Bayes factors), where we might compare an alternative model including word class as a predictor term with a null model lacking this predictor. We will discuss such Bayesian model comparison using Bayes factors in chapter 15. However, if based on such omnibus approaches we find support for an influence of word class on response times, it remains unclear where this effect actually comes from, i.e., whether it originated from the nouns, verbs, or adjectives. This is problematic for inference because scientists typically have specific expectations about which groups differ from each other. In this chapter, we will show how to estimate specific comparisons directly in a Bayesian linear model. This gives the researcher a lot of control over Bayesian analyses. Specifically, we show how planned comparisons between specific conditions (groups) or clusters of conditions, are implemented as contrasts. This is a very effective way to align expectations with the statistical model. In Bayesian models, any specific comparisons can also be computed after the model is fit. Nevertheless, coding a priori expectations into contrasts for model fitting will make it much more straightforward to estimate certain comparisons between experimental conditions, and will allow to perform Bayesian model comparisons using Bayes factors to provide evidence for or against very specific hypotheses.

For this and the next chapter, although knowledge of the matrix formulation of the linear model is not necessary, for a deeper understanding of contrast coding some exposure to the matrix formulation is desirable. We discuss the matrix formulation in Box 5.3 and in the frequentist textbook (Vasishth et al. 2021).

8.1 Basic concepts illustrated using a two-level factor

We first consider the simplest case: suppose we want to compare the means of a dependent variable (DV) such as response times between two groups of subjects. A simulated data set is available in the package bcogsci as the data set df_contrasts1. The simulations assumed longer response times in condition \(F1\) (\(\mu_1 = 0.8\) sec) than \(F2\) (\(\mu_2 = 0.4\) sec). The data from the \(10\) simulated subjects are aggregated and summary statistics are computed for the two groups.

data("df_contrasts1")
df_contrasts1
## # A tibble: 10 × 3
##   F        DV    id
##   <fct> <dbl> <int>
## 1 F1    0.636     1
## 2 F1    0.841     2
## 3 F1    0.555     3
## # ℹ 7 more rows
## [1] 0.6
TABLE 8.1: Summary statistics per condition for the simulated data.
Factor N data Est. means Std. dev. Std. errors
F1 \(5\) \(0.8\) \(0.2\) \(0.1\)
F2 \(5\) \(0.4\) \(0.2\) \(0.1\)
Means and standard errors of the simulated dependent variable (e.g., response times in seconds) in two conditions \(F1\) and \(F2\).

FIGURE 8.1: Means and standard errors of the simulated dependent variable (e.g., response times in seconds) in two conditions \(F1\) and \(F2\).

The results, displayed in Figure 8.1 and in Table 8.1, show that the assumed true condition means are exactly realized with the simulated data. The numbers are exact because the mvrnorm() function used here (see ?df_contrasts1) ensures that the data are generated so that the sample mean yields the true means for each level. In real data sets, of course, the sample means will vary from experiment to experiment.

A simple Bayesian linear model of DV on \(F\) yields a straightforward estimate of the difference between the group means. We use relatively uninformative priors. The estimates for the population-level effects are presented below using the function fixef():

fit_F <- brm(DV ~ F,
             data = df_contrasts1,
             family = gaussian(),
             prior = c(prior(normal(0, 2), class = Intercept),
                       prior(normal(0, 2), class = sigma),
                       prior(normal(0, 1), class = b)))
fixef(fit_F)
##           Estimate Est.Error  Q2.5 Q97.5
## Intercept     0.79      0.11  0.56  1.02
## FF2          -0.39      0.16 -0.72 -0.08

Comparing the means for each condition with the coefficients (Estimates) reveals that (i) the intercept (\(0.8\)) is the mean for condition \(F1\), \(\hat\mu_1\); and (ii) the slope (FF2: \(-0.4\)) is the difference between the estimated means for the two groups, \(\hat\mu_2 - \hat\mu_1\) (Bolker 2018):

\[\begin{equation} \begin{array}{lcl} \text{Intercept} = & \hat{\mu}_1 & = \text{estimated mean for } F1 \\ \text{Slope (FF2)} = & \hat{\mu}_2 - \hat{\mu}_1 & = \text{estim. mean for } F2 - \text{estim. mean for }F1 \end{array} \tag{8.1} \end{equation}\]

The new information is the \(95\)% credible interval for the difference between the two groups.

8.1.1 Default contrast coding: Treatment contrasts

How does the function brm() arrive at these particular values for the intercept and slope? That is, why does the intercept assess the mean of condition \(F1\) and how do we know the slope measures the difference in means between \(F2\)-\(F1\)? This result is a consequence of the default contrast coding of the factor \(F\). R assigns treatment contrasts to factors and orders their levels alphabetically. The alphabetically first factor level (here: \(F1\)) is coded in R by default as \(0\) and the second level (here: \(F2\)) is coded as \(1\).32 This becomes clear when we inspect the current contrast attribute of the factor using the contrasts() command:

contrasts(df_contrasts1$F)
##    F2
## F1  0
## F2  1

Why does this contrast coding yield these particular regression coefficients? Let’s take a look at the regression equation. Let \(\alpha\) represent the intercept, and \(\beta_1\) the slope. Then, the simple regression above expresses the belief that the expected response time \(\hat{y}\) (or \(E[Y]\)) is a linear function of the factor \(F\).

\[\begin{equation} E[Y] = \alpha + \beta_1 x \label{eq:lm1} \end{equation}\]

So, if \(x = 0\) (condition \(F1\)), the expectation is \(\alpha + \beta_1 \cdot 0 = \alpha\); and if \(x = 1\) (condition \(F2\)), the expectation is \(\alpha + \beta_1 \cdot 1 = \alpha + \beta_1\).

Expressing the above in terms of the estimated coefficients:

\[\begin{equation} \begin{array}{lccll} \text{estim. value for }F1 = & \hat{\mu}_1 = & \hat{\alpha} = & \text{Intercept} \\ \text{estim. value for }F2 = & \hat{\mu}_2 = & \hat{\alpha} + \hat{\beta}_1 = & \text{Intercept} + \text{Slope (FF2)} \end{array} \label{eq:predVal} \end{equation}\]

It is useful to think of such unstandardized regression coefficients as difference scores; they express the increase in the dependent variable \(y\) associated with a change in the independent variable \(x\) of \(1\) unit, such as going from \(0\) to \(1\) in this example. The difference between condition means is \(0.4 - 0.8 = -0.4\), which is the estimated regression coefficient \(\hat{\beta}_1\). The sign of the slope is negative because we have chosen to subtract the larger mean \(F1\) score from the smaller mean \(F2\) score.

8.1.2 Defining comparisons

The analysis of the regression equation demonstrates that in the treatment contrast the intercept assesses the average response in the baseline condition, whereas the slope estimates the difference between condition means. However, these are just verbal descriptions of what each coefficient assesses. Is it also possible to formally write down what each coefficient assesses?

From the perspective of parameter estimation, the slope represents the effect of main interest, so we consider this first. The treatment contrast specifies that the slope \(\beta_1\) estimates the difference in means between the two levels of the factor \(F\). This can formally be written as:

\[\begin{equation} \beta_1 = \mu_{F2} - \mu_{F1} \end{equation}\]

or equivalently:

\[\begin{equation} \beta_1 = - 1 \cdot \mu_{F1} + 1 \cdot \mu_{F2} \end{equation}\]

The \(\pm 1\) weights in the parameter estimation directly express which means are compared by the treatment contrast.

The intercept in the treatment contrast estimates a quantity that is usually of little interest: it estimates the mean in condition \(F1\). Formally, the parameter \(\alpha\) estimates the following quantity:

\[\begin{equation} \alpha = \mu_{F1} \end{equation}\]

or equivalently:

\[\begin{equation} \alpha = 1 \cdot \mu_{F1} + 0 \cdot \mu_{F2} . \end{equation}\]

The fact that the intercept term formally estimates the mean of condition \(F1\) is in line with our previous derivation (see equation (8.1)).

In R, factor levels are ordered alphabetically and by default the first level is used as the baseline in treatment contrasts. Obviously, this default mapping will depend on the levels’ alphabetical ordering. If a different baseline condition is desired, it is possible to re-order the levels. Here is one way of re-ordering the levels:

df_contrasts1$Fb <- factor(df_contrasts1$F, levels = c("F2", "F1"))
contrasts(df_contrasts1$Fb)
##    F1
## F2  0
## F1  1

This re-ordering did not change any data associated with the factor, only one of its attributes. With this new contrast attribute, a simple Bayesian model yields the following result.

fit_Fb <- brm(DV ~ Fb,
              data = df_contrasts1,
              family = gaussian(),
              prior = c(prior(normal(0, 2), class = Intercept),
                        prior(normal(0, 2), class = sigma),
                        prior(normal(0, 1), class = b)))
fixef(fit_Fb)
##           Estimate Est.Error Q2.5 Q97.5
## Intercept     0.40      0.11 0.18  0.62
## FbF1          0.39      0.16 0.08  0.71

The model now estimates different quantities. The intercept now codes the mean of condition \(F2\), and the slope measures the difference in means between \(F1\) minus \(F2\). This represents an alternative coding of the treatment contrast.

These model posteriors do not provide evidence for the hypothesis that the effect of factor \(F\) is different from zero. If the research focus is on such hypothesis testing, Bayesian hypothesis tests can be carried out using Bayes factors, by comparing a model containing a contrast of interest with a model lacking this contrast. We will discuss details of Bayesian hypothesis testing based on Bayes factors in chapter 15.

8.1.3 Sum contrasts

Treatment contrasts are only one of many options. It is also possible to use sum contrasts, which code one of the conditions as \(-1\) and the other as \(+1\), effectively centering the effects at the grand mean (GM, i.e., the mean of the two group means). Here, we rescale the contrast to values of \(-0.5\) and \(+0.5\), which makes the estimated treatment effect the same as for treatment coding and easier to interpret.

To define this contrast in a linear regression, one way is to use the contrasts function (another way is to define a column containing \(+0.5\) and \(-0.5\) for the corresponding levels of the factor).

contrasts(df_contrasts1$F) <- c(-0.5, +0.5)
fit_mSum <- brm(DV ~ F,
                data = df_contrasts1,
                family = gaussian(),
                prior = c(prior(normal(0, 2), class = Intercept),
                          prior(normal(0, 2), class = sigma),
                          prior(normal(0, 1), class = b)))
fixef(fit_mSum)
##           Estimate Est.Error  Q2.5 Q97.5
## Intercept     0.60      0.08  0.44  0.75
## F1           -0.39      0.16 -0.69 -0.08

Here, the slope (\(F1\)) again codes the difference of the groups associated with the first and second factor levels. It has the same value as in the treatment contrast. One important difference from the treatment contrast is that the intercept now represents the estimate of the average of condition means for \(F1\) and \(F2\), that is, the grand mean. For the scaled sum contrast:

\[\begin{equation} \begin{array}{lcl} \text{Intercept} = & (\hat{\mu}_1 + \hat{\mu}_2)/2 & = \text{estimated mean of }F1 \text{ and }F2 \\ \text{Slope (F1)} = & \hat{\mu}_2 - \hat{\mu}_1 & = \text{estim. mean of }F2 - \text{estim. mean for} F1 \end{array} \label{eq:beta2} \end{equation}\]

Why does the intercept assess the grand mean and why does the slope estimate the group difference? This is the result of rescaling the sum contrast. The first factor level (\(F1\)) was coded as \(-0.5\), and the second factor level (\(F2\)) as \(+0.5\):

contrasts(df_contrasts1$F)
##    [,1]
## F1 -0.5
## F2  0.5

Look again at the regression equation to better understand what computations are performed. Again, \(\alpha\) represents the intercept, \(\beta_1\) represents the slope, and the predictor variable \(x\) represents the factor \(F\). The regression equation is written as:

\[\begin{equation} E[Y] = \alpha + \beta_1 x \label{eq:lm2} \end{equation}\]

The group of \(F1\) subjects is then coded as \(-0.5\), and the response time for the group of \(F1\) subjects is \(\alpha + \beta_1 \cdot x_1 = 0.6 + (-0.4) \cdot (-0.5) = 0.8\). By contrast, the \(F2\) group is coded as \(+0.5\). By implication, the mean of the \(F2\) group must be \(\alpha + \beta_1 \cdot x_1 = 0.6 + (-0.4) \cdot 0.5 = 0.4\). Expressed in terms of the estimated coefficients:

\[\begin{equation} \begin{array}{lccll} \text{estim. value for }F1 = & \hat{\mu}_1 = & \hat{\alpha} - 0.5 \cdot \hat{\beta}_1 = & \text{Intercept} - 0.5 \cdot \text{Slope (F1)}\\ \text{estim. value for }F2 = & \hat{\mu}_2 = & \hat{\alpha} + 0.5 \cdot \hat{\beta}_1 = & \text{Intercept} + 0.5 \cdot \text{Slope (F1)} \end{array} \label{eq:predVal2} \end{equation}\]

The unstandardized regression coefficient is a difference score: Taking a step of one unit on the predictor variable \(x\), e.g., from \(-0.5\) to \(+0.5\), reflecting a step from condition \(F1\) to \(F2\), changes the dependent variable from \(0.8\) (for condition \(F1\)) to \(0.4\) (condition \(F2\)). This reflects a difference of \(0.4 - 0.8 = -0.4\); this is again the estimated regression coefficient \(\hat{\beta}_1\). Moreover, as mentioned above, the intercept now assesses the grand mean, i.e., the unweighted mean of the means for conditions \(F1\) and \(F2\): it is in the middle between condition means for \(F1\) and \(F2\).

So far we gave verbal statements about what is estimated by the intercept and the slope in the case of the scaled sum contrast. It is possible to write these statements as formal parameter estimates. In scaled sum contrasts, the slope parameter \(\beta_1\) assesses the following quantity:

\[\begin{equation} \beta_1 = -1 \cdot \mu_{F1} + 1 \cdot \mu_{F2} \end{equation}\]

This estimates the same quantity as the slope in the treatment contrast. The intercept now assesses a different quantity: the average of the two conditions \(F1\) and \(F2\):

\[\begin{equation} \alpha = 1/2 \cdot \mu_{F1} + 1/2 \cdot \mu_{F2} = \frac{\mu_{F1} + \mu_{F2}}{2} \end{equation}\]

In balanced data, i.e., in data sets where there are no missing data points, the average of the two conditions \(F1\) and \(F2\) is the grand mean. In unbalanced data sets, where there are missing values, this average is the weighted grand mean. To illustrate this point, consider an example with fully balanced data and two equal group sizes of \(5\) subjects for each group \(F1\) and \(F2\). Here, the grand mean is also the mean across all subjects. Next, consider a highly simplified unbalanced data set, where in condition \(F1\) two observations of the dependent variable are available with values of \(2\) and \(3\), and where in condition \(F2\) only one observation of the dependent variable is available with a value of \(4\). In this data set, the mean across all subjects is \(\frac{2 + 3 + 4}{3} = \frac{9}{3} = 3\). However, the (weighted) grand mean as assessed in the intercept in a model using sum contrasts for factor \(F\) would first compute the mean for each group separately (i.e., \(\frac{2 + 3}{2} = 2.5\), and \(4\)), and then compute the mean across conditions \(\frac{2.5 + 4}{2} = \frac{6.5}{2} = 3.25\). The grand mean of \(3.25\) is different from the mean across subjects of \(3\).

To summarize, treatment contrasts and sum contrasts are two possible ways to parameterize the difference between two groups; they generally estimate different quantities. Treatment contrasts compare one or more means against a baseline condition, whereas sum contrasts compare a condition’s mean to the grand mean (which in the two-group case also implies estimating the difference between the two group means). One question that comes up here is: how does one know or how can one formally derive what quantities are estimated by a given set of contrasts? (In the context of Bayes factors, the question would be: what hypothesis test does the contrast coding encode?) This question will be discussed in detail below for the general case of any arbitrary contrast coding.

8.1.4 Cell means parameterization and posterior comparisons

One alternative option is to use what is called the cell means parameterization (this coding is also called “one-hot encoding” in the context of machine learning). In this approach, one does not estimate an intercept term, and then differences between factor levels. Instead, each free parameter is used to simply estimate the mean of one of the factor levels. As a consequence, no comparisons between condition means are estimated, but simply the mean of each experimental condition is estimated. Cell means parameterization is specified by explicitly removing the intercept term (which is added automatically in brms) by adding a \(-1\) in the regression formula:

fit_mCM <- brm(DV ~ -1 + F,
               data = df_contrasts1,
               family = gaussian(),
               prior = c(prior(normal(0, 2), class = sigma),
                         prior(normal(0, 2), class = b)))
fixef(fit_mCM)
##     Estimate Est.Error Q2.5 Q97.5
## FF1      0.8      0.11 0.58  1.03
## FF2      0.4      0.11 0.18  0.63

Now, the regression coefficients (see the column labeled Estimate) estimate the mean of the first factor level (\(0.8\)) and the mean of the second factor level (\(0.4\)). This cell means parameterization usually does not allow us to make inferences about the hypotheses of interest using Bayes factors, as these hypotheses usually relate to differences between conditions rather than to whether each condition differs from zero.

The cell means parameterization provides a good example demonstrating an advantage of Bayesian data analysis: In Bayesian models, it is possible to use the posterior samples to compute new estimates that were not directly contained in the fitted model. To implement this, we first extract the posterior samples from the brm() model object:

df_postSamp <- as_draws_df(fit_mCM)

In a second step, we can then compute comparisons from these posterior samples. For example, we can compute the difference between conditions \(F2\) and \(F1\). To do so, we simply take the posterior samples for each condition, and compute their difference.

df_postSamp  <- df_postSamp %>%
  mutate(b_dif = b_FF2 - b_FF1)

This provides a posterior sample of the difference between conditions. It is possible to investigate this posterior sample by looking at its mean and 95% credible interval:

c(Estimate = mean(df_postSamp$b_dif),
  quantile(df_postSamp$b_dif, p = c(0.025, 0.975)))
## Estimate     2.5%    97.5% 
##   -0.399   -0.709   -0.101

The above summary provides the same estimate (roughly \(-0.4\)) that we obtained previously when using the treatment contrast or the scaled sum contrast. (Note that if the priors are sufficiently non-informative, the results will be similar. However, given that the priors differ in comparison to a model estimating the difference directly, the posterior may also differ between models.)

Thus, Bayesian models provide a lot of flexibility in computing new comparisons post-hoc from the posterior samples and in obtaining their posterior distributions. However, what these posterior computations do not provide directly are inferences on null hypotheses. That is, just by looking at the credible intervals, we cannot make inferences about whether a null hypothesis can be rejected; an explicit hypothesis test is needed to answer such a question (see chapter 15).

8.2 The hypothesis matrix illustrated with a three-level factor

Consider an example with the three word classes nouns, verbs, and adjectives. We load simulated data from a lexical decision task with response times as dependent variable. The research question is: do response times differ as a function of the between-subject factor word class with three levels: nouns, verbs, and adjectives? Here, just to illustrate the case of a three-level factor, we make the arbitrary assumption that nouns have longest response times and adjectives the shortest response times. Word class is specified as a between-subject factor. In cognitive science experiments, word class will usually vary within subjects and between items. Because the within- or between-subjects status of an effect is independent of its contrast coding, we assume the manipulation to be between subjects for ease of exposition. The concepts presented here extend to repeated measures designs that are often analyzed using hierarchical Bayesian (linear mixed) models.

Load and display the simulated data.

data("df_contrasts2")
head(df_contrasts2)
## # A tibble: 6 × 3
##   F        DV    id
##   <fct> <int> <int>
## 1 nouns   476     1
## 2 nouns   517     2
## 3 nouns   491     3
## # ℹ 3 more rows
TABLE 8.2: Summary statistics per condition for the simulated data.
Factor N data Est. means Std. dev. Std. errors
adjectives \(4\) \(400.2\) \(19.9\) \(9.9\)
nouns \(4\) \(500.0\) \(20.0\) \(10.0\)
verbs \(4\) \(450.2\) \(20.0\) \(10.0\)

As shown in Table 8.2, the estimated means reflect our assumptions about the true means in the data simulation: Response times are longest for nouns and shortest for adjectives. In the following sections, we use this data set to illustrate sum contrasts. Furthermore, we will use an additional data set to illustrate repeated, Helmert, polynomial, and custom contrasts. In practice, usually only one set of contrasts is selected when the expected pattern of means is formulated during the design of the experiment.

8.2.1 Sum contrasts

We begin with sum contrasts. Suppose that the expectation is that nouns are responded to slower and adjectives are responded to faster than the grand mean response time. Then, the research question could be: By how much do nouns differ from the grand mean and by how much do adjectives differ from the grand mean? And are the responses slower or faster than the grand mean? We want to estimate the following two quantities:

\[\begin{equation} \beta_1 = \mu_1 - \frac{\mu_1+\mu_2+\mu_3}{3} = \mu_1 - GM \end{equation}\]

and

\[\begin{equation} \beta_2 = \mu_2 - \frac{\mu_1+\mu_2+\mu_3}{3} = \mu_2 - GM \end{equation}\]

\(\beta_1\) can also be written as:

\[\begin{equation} \label{h01} \beta_1 = \frac{2}{3} \mu_1 - \frac{1}{3}\mu_2 - \frac{1}{3}\mu_3 \end{equation}\]

Here, the weights \(2/3, -1/3, -1/3\) are informative about how to combine the condition means to estimate the linear model coefficient.

\(\beta_2\) is also rewritten as:

\[\begin{align}\label{h02} \beta_2 =& \mu_2 - \frac{\mu_1+\mu_2+\mu_3}{3} \\ \Leftrightarrow \beta_2 =& -\frac{1}{3}\mu_1 + \frac{2}{3} \mu_2 - \frac{1}{3} \mu_3 \end{align}\]

Here, the weights are \(-1/3, 2/3, -1/3\), and they again indicate how to combine the condition means for estimating the regression coefficient.

8.2.2 The hypothesis matrix

The weights of the condition means are not only useful for defining parameter estimates. They also provide the starting step in a very powerful method which allows the researcher to generate the contrasts that are needed to estimate these comparisons in a linear model. That is, what we did so far is to explain some kinds of different contrast codings that exist and what comparisons are estimated by these contrasts. That is, if a certain data set is given and the goal is to estimate certain comparisons, then the procedure would be to check whether any of the contrasts that we encountered estimate these comparisons of interest. Sometimes it suffices to use one of these existing contrasts. At other times, our research questions may not correspond exactly to any of the contrasts in the default set of standard contrasts provided in R. For these cases, or for more complex designs, it is very useful to know how contrast matrices are created. Indeed, a relatively simple procedure exists in which we write our comparisons formally, extract the weights of the condition means from the comparisons, and then automatically generate the correct contrast matrix that we need in order to estimate these comparisons in a linear model. Using this powerful method, it is not necessary to find a match to a contrast matrix provided by the family of functions in R starting with the prefix contr. Instead, it is possible to simply define the comparisons that one wants to estimate, and to obtain the correct contrast matrix for these in an automatic procedure. Here, for pedagogical reasons, we show some examples of how to apply this procedure in cases where the comparisons do correspond to some of the existing contrasts.

Defining a custom contrast matrix involves four steps:

  1. Write down the estimated comparisons
  2. Extract the weights and write them into what we will call a hypothesis matrix (and can also be viewed as a comparison matrix)
  3. Apply the generalized matrix inverse to the hypothesis matrix to create the contrast matrix
  4. Assign the contrast matrix to the factor and run the (Bayesian) model

The term hypothesis matrix is used here because contrast coding is often done to carry out an explicit hypothesis test; but one could of course use contrast coding just to compute the estimates of an effect and their uncertainty, without doing a hypothesis test.

Let us apply this four-step procedure to our example of the sum contrast. The first step, writing down the estimated comparisons, as done in section 8.2.1. The second step involves writing down the weights that each comparison gives to condition means. The weights for the first comparison are wH01=c(+2/3, -1/3, -1/3), and the weights for the second comparison are wH02=c(-1/3, +2/3, -1/3).

Before writing these into a hypothesis matrix, we also define the estimated quantity for the intercept term. The intercept parameter estimates the mean across all conditions:

\[\begin{align} \alpha = \frac{\mu_1 + \mu_2 + \mu_3}{3} \\ \alpha = \frac{1}{3} \mu_1 + \frac{1}{3}\mu_2 + \frac{1}{3}\mu_3 \end{align}\]

This estimate has weights of \(1/3\) for all condition means. The weights from all three model parameters that were defined are now combined and written into a matrix that we refer to as the hypothesis matrix (Hc):

HcSum <-
  rbind(cH00 = c(adjectives = 1 / 3, nouns = 1 / 3, verbs = 1 / 3),
        cH01 = c(adjectives = 2 / 3, nouns = -1 / 3, verbs = -1 / 3),
        cH02 = c(adjectives = -1 / 3, nouns = 2 / 3, verbs = -1 / 3))
fractions(t(HcSum))
##            cH00 cH01 cH02
## adjectives  1/3  2/3 -1/3
## nouns       1/3 -1/3  2/3
## verbs       1/3 -1/3 -1/3

Each set of weights is first entered as a row into the matrix (command rbind()) (see Schad et al. 2020). We switch rows and columns of the matrix for easier readability using the command t() (this transposes the matrix). The command fractions() from MASS library turns the decimals into fractions to improve readability.

Now that the condition weights have been written into the hypothesis matrix, the third step of the procedure is implemented: a matrix operation called the generalized matrix inverse33 is used to obtain the contrast matrix that is needed to estimate these comparisons in a linear model.

Use the function ginv2() from the bcogsci package for the next step. It is similar to ginv() from MASS but provides nicer formatting of the output.

Applying the generalized inverse to the hypothesis matrix results in the new matrix XcSum. This is the contrast matrix \(X_c\) that estimates exactly those comparisons that were specified earlier:

(XcSum <- ginv2(HcSum))
##            cH00 cH01 cH02
## adjectives  1    1    0  
## nouns       1    0    1  
## verbs       1   -1   -1

This contrast matrix corresponds exactly to the sum contrasts described above. In the case of the sum contrast, the contrast matrix looks very different from the hypothesis matrix. The contrast matrix in sum contrasts codes with \(+1\) the condition that is to be compared to the grand mean. The condition that is never compared to the grand mean is coded as \(-1\). Without knowing the relationship between the hypothesis matrix and the contrast matrix, the meaning of the coefficients is completely opaque.

To verify this custom-made contrast matrix, it is compared to the sum contrast matrix as generated by the R function contr.sum() in the stats package. The resulting contrast matrix is identical to the result when adding the intercept term, a column of ones, to the contrast matrix:

fractions(cbind(1, contr.sum(3)))
##   [,1] [,2] [,3]
## 1  1    1    0  
## 2  1    0    1  
## 3  1   -1   -1

In order to estimate model parameters, step four in our procedure involves assigning sum contrasts to the factor \(F\) in our example data, and fitting a (Bayesian) linear model. This allows us to estimate the regression coefficients associated with each contrast. We compare these to the data shown above (Table 8.2) to test whether the regression coefficients actually correspond to the differences of condition means, as intended. To define the contrast, it is necessary to remove the intercept term, as this is automatically added by the modeling function brm().

contrasts(df_contrasts2$F) <- XcSum[, 2:3]
fit_Sum <- brm(DV ~ F,
               data = df_contrasts2,
               family = gaussian(),
               prior = c(prior(normal(500, 100), class = Intercept),
                         prior(normal(0, 100), class = sigma),
                         prior(normal(0, 100), class = b)))
fixef(fit_Sum)
##           Estimate Est.Error  Q2.5 Q97.5
## Intercept    450.2      6.74 436.7 463.7
## FcH01        -49.6      9.82 -69.4 -29.4
## FcH02         49.6      9.95  29.8  69.4

The (Bayesian) linear model regression coefficients show the grand mean response time of \(450\) ms in the intercept. Remember that the first regression coefficient FcH01 was designed to estimate the extent to which adjectives are responded to faster than the grand mean. The regression coefficient FcH01 (Estimate) of \(-50\) reflects the difference between adjectives (\(400\) ms) and the grand mean of \(450\) ms. The second estimate of interest tells us the extent to which response times for nouns differ from the grand mean. The fact that the second regression coefficient FcH02 is close to \(50\) indicates that response times for nouns are (\(500\) ms) slower than the grand mean of \(450\) ms. Although the nouns are estimated to have \(50\) ms longer reading times than the grand mean, the reading times for adjectives are \(50\) ms faster than the grand mean.

We have now not only derived contrasts, parameter estimates, and comparisons for the sum contrast, we have also used a powerful and highly general procedure that is used to generate contrasts for many kinds of different comparisons and experimental designs.

8.2.3 Generating contrasts: The hypr package

To work with the four-step procedure, i.e., to flexibly design contrasts to estimate specific comparisons, we have developed the R package hypr (Rabe et al. 2020a). This package allows the researcher to specify the desired comparisons, and based on these comparisons, it automatically generates contrast matrices that allow us to estimate these comparisons in linear models. The functions available in this package thus considerably simplify the implementation of the four-step procedure outlined above. Because hypr was originally written with the frequentist framework in mind, the comparisons are expressed as null hypotheses. In the Bayesian framework, these should be treated as comparisons between (bundles of) condition means. To illustrate the functionality of the hypr package, we will use the two comparisons that we had defined and analyzed in the previous section:

\[\begin{equation} \beta_1 = \mu_1 - \frac{\mu_1+\mu_2+\mu_3}{3} = \mu_1 - GM \end{equation}\]

and

\[\begin{equation} \beta_2 = \mu_2 - \frac{\mu_1+\mu_2+\mu_3}{3} = \mu_2 - GM \end{equation}\]

These estimates are effectively comparisons between condition means or between bundles of condition means. That is, both \(\mu_1\) and \(\mu_2\) are compared to the grand mean. These two comparisons can be directly entered into R using the hypr() function from the hypr package. To do so, we use some labels to indicate factor levels. E.g., adjectives, nouns, and verbs can represent factor levels \(\mu_1\), \(\mu_2\), and \(\mu_3\). The first comparison specifies that \(\mu_1\) is compared to \(\frac{\mu_1+\mu_2+\mu_3}{3}\). This can be written as a formula in R: adjectives ~ (adjectives + nouns + verbs)/3. The second comparison is that \(\mu_2\) is compared to \(\frac{\mu_1+\mu_2+\mu_3}{3}\), which can be written as nouns ~ (adjectives + nouns + verbs)/3.

HcSum <- hypr(b1 = adjectives ~ (adjectives + nouns + verbs) / 3,
              b2 = nouns ~ (adjectives + nouns + verbs) / 3,
              levels = c("adjectives", "nouns", "verbs"))
HcSum
## hypr object containing 2 null hypotheses:
## H0.b1: 0 = (2*adjectives - nouns - verbs)/3
## H0.b2: 0 = (2*nouns - adjectives - verbs)/3
## 
## Call:
## hypr(b1 = ~2/3 * adjectives - 1/3 * nouns - 1/3 * verbs, b2 = ~2/3 * 
##     nouns - 1/3 * adjectives - 1/3 * verbs, levels = c("adjectives", 
## "nouns", "verbs"))
## 
## Hypothesis matrix (transposed):
##            b1   b2  
## adjectives  2/3 -1/3
## nouns      -1/3  2/3
## verbs      -1/3 -1/3
## 
## Contrast matrix:
##            b1 b2
## adjectives  1  0
## nouns       0  1
## verbs      -1 -1

The results show that the comparisons between condition means have been re-written into a form where \(0\) is coded on the left side of the equation, and the condition means together with associated weights are written on the right side of the equation. This presentation makes it easy to see the weights of the condition means to code a certain comparison. The next part of the results shows the hypothesis matrix, which contains the weights from the condition means. Thus, hypr takes formulas coding comparisons between condition means as input, and automatically extracts the corresponding weights and encodes them into the hypothesis matrix. hypr moreover applies the generalized matrix inverse to obtain the contrast matrix from the hypothesis matrix. The different steps correspond exactly to the steps we had carried out manually in the preceding section. hypr automatically performs these steps for us. We can now extract the contrast matrix by a simple function call:

contr.hypothesis(HcSum)
##            b1 b2
## adjectives  1  0
## nouns       0  1
## verbs      -1 -1
## attr(,"class")
## [1] "hypr_cmat" "matrix"    "array"

We can assign this contrast to our factor as we did before.

contrasts(df_contrasts2$F) <- contr.hypothesis(HcSum)

Now, we could again run the same model. However, since the contrast matrix is now the same as used before, the results would also be exactly the same, and we therefore skip the model fitting for brevity.

The hypr package can be used to create contrasts for Bayesian models, where the focus lies on estimation of contrasts that code comparisons between condition means or bundles of condition means. (Of course, one can use contrast coding for carrying out hypothesis tests using the Bayes factor; see chapter 15; in this case, the priors matter a lot for the results.) Thus, the comparison that one specifies implies the estimation of a difference between condition means or bundles of condition means. We see this in the output of the hypr() function (see the first section of the results); these formulate the comparison in a way that also illustrates the estimation of model parameters. That is, the comparison (expressed in the hypr package’s syntax) \(\mu_1 \sim \frac{\mu_1+\mu_2+\mu_3}{3}\) corresponds to a parameter estimate of b1 = 2/3*m1 - 1/3*m2 - 1/3*m3, where \(m1\) to \(m3\) are the means for each of the conditions. The resulting contrasts will then allow us to estimate the specified differences between condition means or bundles of condition means.

8.3 Other types of contrasts: illustration with a factor of four levels

Here, we introduce repeated difference, Helmert, and polynomial contrasts.34 For these, it may be instructive to consider an experiment with one between-subject factor with four levels. We load a corresponding data set, which contains simulated data about response times with a four-level between-subject factor. The sample sizes for each level and the means and standard errors are shown in Table 8.3, and the means and standard errors are also shown graphically in Figure 8.2.

data("df_contrasts3")
The means and error bars (showing standard errors) for a simulated data set with one between-subjects factor with four levels.

FIGURE 8.2: The means and error bars (showing standard errors) for a simulated data set with one between-subjects factor with four levels.

TABLE 8.3: Summary statistics per condition for the simulated data.
Factor N data Est. means Std. dev. Std. errors
F1 \(5\) \(10.0\) \(10.0\) \(4.5\)
F2 \(5\) \(20.0\) \(10.0\) \(4.5\)
F3 \(5\) \(10.0\) \(10.0\) \(4.5\)
F4 \(5\) \(40.0\) \(10.0\) \(4.5\)

We assume that the four factor levels \(F1\) to \(F4\) reflect levels of word frequency, including the levels low, medium-low, medium-high, and high frequency words, and that the dependent variable reflects response time.

Qualitatively, the simulated pattern of results is similar to empirically observed values for word frequency effects on single fixation durations in eye tracking (Heister, Würzner, and Kliegl 2012).

8.3.1 Repeated contrasts

Arguably, the most popular contrast psychologists and psycholinguists are interested in is the comparison between neighboring levels of a factor. This type of contrast is called the repeated contrast. In our example, our research question might be whether the frequency level leads to slower response times than frequency level , whether frequency level leads to slower response times than frequency level , and whether frequency level leads to slower response times than frequency level .

Repeated contrasts are used to implement these comparisons. Consider first how to derive the contrast matrix for repeated contrasts, starting out by specifying the comparisons that are to be estimated. Importantly, this again applies the general strategy of how to translate (any) comparisons between groups or conditions into a set of contrasts, yielding a powerful tool of great value in many research settings. We follow the four-step procedure outlined above.

The first step is to specify our comparisons, and to write them down in a way such that their weights can be extracted easily. For a four-level factor, the three comparisons are:

\[\begin{equation} \beta_{2-1} = -1 \cdot \mu_1 + 1 \cdot \mu_2 + 0 \cdot \mu_3 + 0 \cdot \mu_4 \end{equation}\]

\[\begin{equation} \beta_{3-2} = 0 \cdot \mu_1 - 1 \cdot \mu_2 + 1 \cdot \mu_3 + 0 \cdot \mu_4 \end{equation}\]

\[\begin{equation} \beta_{4-3} = 0 \cdot \mu_1 + 0 \cdot \mu_2 - 1 \cdot \mu_3 + 1 \cdot \mu_4 \end{equation}\]

Here, the \(\mu_x\) are the mean response times in condition \(x\). Each regression coefficient gives weights to the different condition means. For example, the first estimate (\(\beta_{2-1}\)) estimates the difference between condition mean for \(F2\) (\(\mu_2\)) minus the condition mean for \(F1\) (\(\mu_1\)), but ignores condition means for \(F3\) and \(F4\) (\(\mu_3\), \(\mu_4\)). \(\mu_1\) has a weight of \(-1\), \(\mu_2\) has a weight of \(+1\), and \(\mu_3\) and \(\mu_4\) have weights of \(0\).

We can write these comparisons into hypr:

HcRep <- hypr(c2vs1 = F2 ~ F1,
              c3vs2 = F3 ~ F2,
              c4vs3 = F4 ~ F3,
              levels = c("F1", "F2", "F3", "F4"))
HcRep
## hypr object containing 3 null hypotheses:
## H0.c2vs1: 0 = F2 - F1
## H0.c3vs2: 0 = F3 - F2
## H0.c4vs3: 0 = F4 - F3
## 
## Call:
## hypr(c2vs1 = ~F2 - F1, c3vs2 = ~F3 - F2, c4vs3 = ~F4 - F3, levels = c("F1", 
## "F2", "F3", "F4"))
## 
## Hypothesis matrix (transposed):
##    c2vs1 c3vs2 c4vs3
## F1 -1     0     0   
## F2  1    -1     0   
## F3  0     1    -1   
## F4  0     0     1   
## 
## Contrast matrix:
##    c2vs1 c3vs2 c4vs3
## F1 -3/4  -1/2  -1/4 
## F2  1/4  -1/2  -1/4 
## F3  1/4   1/2  -1/4 
## F4  1/4   1/2   3/4

The hypothesis matrix shows exactly the weights that we had written down above. Moreover, we see the contrast matrix. In the case of the repeated contrast, the contrast matrix again looks very different from the hypothesis matrix. In this case, the contrast matrix looks a lot less intuitive than the hypothesis matrix, and if one did not know the associated hypothesis matrix, it seems unclear what the contrast matrix would actually test. To verify this custom-made contrast matrix, we compare it to the repeated contrast matrix as generated by the R function contr.sdif() in the package (Ripley 2023). The resulting contrast matrix is identical to our result:

fractions(contr.sdif(4))
##   2-1  3-2  4-3 
## 1 -3/4 -1/2 -1/4
## 2  1/4 -1/2 -1/4
## 3  1/4  1/2 -1/4
## 4  1/4  1/2  3/4

We can thus use either approach (hypr() or contr.sdif()) to obtain the contrast matrix in this case. Next, we apply the repeated contrasts to the factor \(F\) in the example data and run a linear model. This allows us to estimate the regression coefficients associated with each contrast. These are compared to the data in Figure 8.2 to test whether the regression coefficients actually correspond to the differences between successive condition means, as intended.

contrasts(df_contrasts3$F) <- contr.hypothesis(HcRep)
fit_Rep <- brm(DV ~ F,
               data = df_contrasts3,
               family = gaussian(),
               prior = c(prior(normal(20, 50), class = Intercept),
                         prior(normal(0, 50), class = sigma),
                         prior(normal(0, 50), class = b)))
fixef(fit_Rep)
##           Estimate Est.Error   Q2.5 Q97.5
## Intercept    19.99      2.58  15.00 25.10
## Fc2vs1        9.57      6.94  -4.13 23.19
## Fc3vs2       -9.20      6.94 -22.95  4.23
## Fc4vs3       29.09      6.98  15.22 43.09

The results show that as expected, the regression coefficients reflect the differences that were of interest: the regression coefficient (Estimate) Fc2vs1 has a value of approximately \(10\), which corresponds to the difference between the condition mean for \(F2\) (\(20\)) minus the condition mean for \(F1\) (\(10\)), i.e., \(20 - 10 = 10\). Likewise, the regression coefficient Fc3vs2 has a value of approximately \(-10\), which corresponds to the difference between the condition mean for \(F3\) (\(10\)) minus the condition mean for \(F2\) (\(20\)), i.e., \(10 - 20 = -10\). Finally, the regression coefficient Fc4vs3 has a value of roughly \(30\), which reflects the difference between condition \(F4\) (\(40\)) minus condition \(F3\) (\(10\)), i.e., \(40 - 10 = 30\). Thus, the regression coefficients estimate differences between successive or neighboring condition means.

8.3.2 Helmert contrasts

Another common contrast is the Helmert contrast. In a Helmert contrast for our four-level factor, the first contrast compares level \(F1\) versus \(F2\). The second contrast compares level \(F3\) to the average of the first two, i.e., F3 ~ (F1+F2)/2. The third contrast then compares level \(F4\) to the average of the first three. We can code this contrast in hypr:

HcHel <- hypr(b1 = F2 ~ F1,
              b2 = F3 ~ (F1 + F2) / 2,
              b3 = F4 ~ (F1 + F2 + F3) / 3,
              levels = c("F1", "F2", "F3", "F4"))
HcHel
## hypr object containing 3 null hypotheses:
## H0.b1: 0 = F2 - F1
## H0.b2: 0 = F3 - 1/2*F1 - 1/2*F2
## H0.b3: 0 = F4 - 1/3*F1 - 1/3*F2 - 1/3*F3
## 
## Call:
## hypr(b1 = ~F2 - F1, b2 = ~F3 - 1/2 * F1 - 1/2 * F2, b3 = ~F4 - 
##     1/3 * F1 - 1/3 * F2 - 1/3 * F3, levels = c("F1", "F2", "F3", 
## "F4"))
## 
## Hypothesis matrix (transposed):
##    b1   b2   b3  
## F1   -1 -1/2 -1/3
## F2    1 -1/2 -1/3
## F3    0    1 -1/3
## F4    0    0    1
## 
## Contrast matrix:
##    b1   b2   b3  
## F1 -1/2 -1/3 -1/4
## F2  1/2 -1/3 -1/4
## F3    0  2/3 -1/4
## F4    0    0  3/4

The classical Helmert contrast coded by the function contr.helmert() yields a similar but slightly different result:

contr.helmert(4)
##   [,1] [,2] [,3]
## 1   -1   -1   -1
## 2    1   -1   -1
## 3    0    2   -1
## 4    0    0    3

These contrasts are scaled versions of our custom Helmert contrast. I.e., the first column of our custom Helmert contrast has to be multiplied by 2 to get the classical version, the second column has to be multiplied by 3, and the fourth column has to be multiplied by 4 to get to our custom Helmert contrast. Therefore, we suggest that our custom Helmert contrast defined using the hypr function is more appropriate and intuitive to use. Probably the only reason the classical Helmert contrast uses these scaled differences is that the rescaling yields an easier contrast matrix, which consists of integers rather than fractions. The intuitive estimates from our custom Helmert contrast seem much more relevant in Bayesian approaches today. This is because in the custom Helmert contrast, the coefficients estimate the differences between conditions (rather than scaled versions of this). That means that we can set the priors intuitively (as priors on differences between conditions - rather than on scaled differences between conditions) and that we can accordingly interpret the posterior much more straightforwardly. We apply the custom Helmert contrast here:

contrasts(df_contrasts3$F) <- contr.hypothesis(HcHel)
fit_Hel <- brm(DV ~ F,
               data = df_contrasts3,
               family = gaussian(),
               prior = c(prior(normal(20, 50), class = Intercept),
                         prior(normal(0, 50), class = sigma),
                         prior(normal(0, 50), class = b)))
fixef(fit_Hel)
##           Estimate Est.Error   Q2.5 Q97.5
## Intercept    19.97      2.37  15.12 24.64
## Fb1           9.84      6.84  -3.93 23.47
## Fb2          -5.06      6.04 -16.96  6.96
## Fb3          26.40      5.53  15.31 37.52

When we fit the Bayesian model using our custom Helmert contrast, we can see that the estimates reflect the comparisons outlined above. The first estimate Fb1 has a value of roughly \(10\), reflecting the difference between conditions \(F1\) and \(F2\). The second estimate Fb2 has a value of \(5\), which reflects the difference between condition \(F3\) (\(10\)) and the average of the first two conditions (\((10+20)/2=15\)). The estimate Fb3 reflects the difference between \(F4\) (\(40\)) minus the average of the first three, which is \((10+20+10)/3=13.3\), and is thus \(40-13.3=26.7\).

Box 8.1 Treatment contrast with intercept as the grand mean.

Above, we have introduced the treatment contrast, where each contrast compares one condition to a baseline condition. We have discussed that the intercept in the treatment contrast estimates the condition mean for the baseline condition. There are some applications where this behavior may seem sub-optimal. This can be the case in experimental designs with multiple factors, where we may want to use centered contrasts (this is discussed below). Moreover, the contrast coding of the population-level (or fixed) effects also defines what the group-level (or random) effects assess. If the intercept assesses the grand mean–rather than the baseline condition–in hierarchical models, then the group-level intercepts reflect the grand mean variance, rather than the variance in the baseline condition.

It is possible to design a treatment contrast where the intercept reflects the grand mean (assuming a balanced design; otherwise it’s the unweighted grand mean). We implement this using the hypr package. The trick is to add the intercept explicitly as a comparison of the average of all four condition means:

HcTrGM <- hypr(b0 = ~ (F1 + F2 + F3 + F4) / 4,
               b1 = F2 ~ F1,
               b2 = F3 ~ F1,
               b3 = F4 ~ F1,
               levels = c("F1", "F2", "F3", "F4"))
HcTrGM
## hypr object containing 4 null hypotheses:
## H0.b0: 0 = (F1 + F2 + F3 + F4)/4  (Intercept)
## H0.b1: 0 = F2 - F1
## H0.b2: 0 = F3 - F1
## H0.b3: 0 = F4 - F1
## 
## Call:
## hypr(b0 = ~1/4 * F1 + 1/4 * F2 + 1/4 * F3 + 1/4 * F4, b1 = ~F2 - 
##     F1, b2 = ~F3 - F1, b3 = ~F4 - F1, levels = c("F1", "F2", 
## "F3", "F4"))
## 
## Hypothesis matrix (transposed):
##    b0  b1  b2  b3 
## F1 1/4  -1  -1  -1
## F2 1/4   1   0   0
## F3 1/4   0   1   0
## F4 1/4   0   0   1
## 
## Contrast matrix:
##    b0   b1   b2   b3  
## F1    1 -1/4 -1/4 -1/4
## F2    1  3/4 -1/4 -1/4
## F3    1 -1/4  3/4 -1/4
## F4    1 -1/4 -1/4  3/4

The hypothesis matrix now explicitly codes the intercept as the first column, where all hypothesis weights are equal and sum up to one. This is coding the intercept. The other hypothesis weights are as expected for the treatment contrast. The contrast matrix now looks very different compared to the standard treatment contrast. Next, we fit a model with this adapted treatment contrast. The function contr.hypothesis automatically removes the intercept that is encoded in HcTrGM, since this is automatically added by brms.

contrasts(df_contrasts3$F) <- contr.hypothesis(HcTrGM)
fit_TrGM <- brm(DV ~ F,
                data = df_contrasts3,
                family = gaussian(),
                prior = c(prior(normal(20, 50), class = Intercept),
                          prior(normal(0, 50), class = sigma),
                          prior(normal(0, 50), class = b)))
fixef(fit_TrGM)
##           Estimate Est.Error   Q2.5 Q97.5
## Intercept    19.99      2.54  14.99  24.8
## Fb1           9.37      6.98  -4.37  23.2
## Fb2          -0.51      7.02 -14.30  13.5
## Fb3          29.24      6.93  15.09  42.8

The results show that the coefficients reflect comparisons of each condition \(F2\), F3, and F4 to the baseline condition \(F1\). The intercept now captures the grand mean across all four conditions of \(20\).

8.3.3 Contrasts in linear regression analysis: The design or model matrix

We have discussed how different contrasts are created from the hypothesis matrix. What we have not treated in detail is how exactly contrasts are used in a linear model. Here, we will see that the contrasts for a factor in a linear model are just the same thing as continuous numeric predictors (i.e., covariates) in a linear/multiple regression analysis. That is, contrasts are the way to encode discrete factor levels into numeric predictor variables to use in linear/multiple regression analysis, by encoding which differences between factor levels are estimated. The contrast matrix \(X_c\) that we have looked at so far has one entry (row) for each experimental condition. For use in a linear model, the contrast matrix is coded into a design or model matrix \(X\), where each individual data point has one row. The design matrix \(X\) can be extracted using the function model.matrix():

# contrast matrix:
(contrasts(df_contrasts3$F) <- contr.hypothesis(HcRep))
##    c2vs1 c3vs2 c4vs3
## F1 -0.75  -0.5 -0.25
## F2  0.25  -0.5 -0.25
## F3  0.25   0.5 -0.25
## F4  0.25   0.5  0.75
## attr(,"class")
## [1] "hypr_cmat" "matrix"    "array"
# design/model matrix:
covars <- model.matrix(~ 1 + F, df_contrasts3)
(covars <- as.data.frame(covars))
##    (Intercept) Fc2vs1 Fc3vs2 Fc4vs3
## 1            1  -0.75   -0.5  -0.25
## 2            1  -0.75   -0.5  -0.25
## 3            1  -0.75   -0.5  -0.25
## 4            1  -0.75   -0.5  -0.25
## 5            1  -0.75   -0.5  -0.25
## 6            1   0.25   -0.5  -0.25
## 7            1   0.25   -0.5  -0.25
## 8            1   0.25   -0.5  -0.25
## 9            1   0.25   -0.5  -0.25
## 10           1   0.25   -0.5  -0.25
## 11           1   0.25    0.5  -0.25
## 12           1   0.25    0.5  -0.25
## 13           1   0.25    0.5  -0.25
## 14           1   0.25    0.5  -0.25
## 15           1   0.25    0.5  -0.25
## 16           1   0.25    0.5   0.75
## 17           1   0.25    0.5   0.75
## 18           1   0.25    0.5   0.75
## 19           1   0.25    0.5   0.75
## 20           1   0.25    0.5   0.75

For each of the \(20\) subjects, four numbers are stored in this model matrix. They represent the three values of three predictor variables used to predict response times in the task. Indeed, this matrix is exactly the design matrix \(X\) commonly used in multiple regression analysis, where each column represents one numeric predictor variable (covariate), and the first column codes the intercept term.

To further illustrate this, the covariates are extracted from this design matrix and stored separately as numeric predictor variables in the data frame:

df_contrasts3[, c("Fc2vs1", "Fc3vs2", "Fc4vs3")] <- covars[, 2:4]

They are now used as numeric predictor variables in a multiple regression analysis:

fit_m3 <- brm(DV ~ Fc2vs1 + Fc3vs2 + Fc4vs3,
              data = df_contrasts3,
              family = gaussian(),
              prior = c(prior(normal(20, 50), class = Intercept),
                        prior(normal(0, 50), class = sigma),
                        prior(normal(0, 50), class = b)))
fixef(fit_m3)
##           Estimate Est.Error   Q2.5 Q97.5
## Intercept    19.99      2.50  14.92 24.86
## Fc2vs1        9.67      6.92  -4.26 23.16
## Fc3vs2       -9.46      6.86 -22.59  4.55
## Fc4vs3       29.36      6.99  14.82 43.02

The results show that the regression coefficients are the same as in the contrast-based analysis shown in the previous section (on repeated contrasts). This demonstrates that contrasts serve to code discrete factor levels into a linear/multiple regression analysis by numerically encoding comparisons between specific condition means.

8.3.4 Polynomial contrasts

Polynomial contrasts are another option for analyzing factors. Suppose that we expect a linear trend across conditions, where the response increases by a constant magnitude with each successive factor level. This could be the expectation when four levels of a factor reflect decreasing levels of word frequency (i.e., four factor levels: high, medium-high, medium-low, and low word frequency), where one expects the fastest response for high frequency words, and successively slower responses for lower word frequencies. The effect for each individual level of a factor (e.g., as coded via a repeated contrast) may not be strong enough for detecting it in the statistical model. Specifying a linear trend in a polynomial contrast (see effect F.L below) allows us to pool the whole increase (across all four factor levels) into a single coefficient for the linear trend, increasing statistical sensitivity for estimating/detecting the increase. Such a specification constrains the estimate to one interpretable parameter, e.g., a linear increase across factor levels. The larger the number of factor levels, the more parsimonious are polynomial contrasts compared to contrast-based specifications as introduced in the previous sections. Going beyond a linear trend, one may also have expectations about quadratic trends (see the estimate for F.Q below). For example, one may expect an increase only among very low frequency words, but no difference between high and medium-high frequency words.

Here is an example for how to code polynomial contrasts for a four-level factor. In this case, one can estimate a linear (F.L), a quadratic (F.Q), and a cubic (F.C) trend. If more factor levels are present, then higher order trends can be estimated.

Xpol <- contr.poly(4)
(contrasts(df_contrasts3$F) <- Xpol)
##          .L   .Q     .C
## [1,] -0.671  0.5 -0.224
## [2,] -0.224 -0.5  0.671
## [3,]  0.224 -0.5 -0.671
## [4,]  0.671  0.5  0.224
fit_Pol <- brm(DV ~ F,
               data = df_contrasts3,
               family = gaussian(),
               prior = c(prior(normal(20, 50), class = Intercept),
                         prior(normal(0, 50), class = sigma),
                         prior(normal(0, 50), class = b)))
fixef(fit_Pol)
##           Estimate Est.Error  Q2.5 Q97.5
## Intercept    20.04      2.48 14.91  24.9
## F.L          17.71      4.92  8.09  27.4
## F.Q           9.97      4.84  0.39  19.5
## F.C          13.34      5.07  3.27  23.6

In this example, condition means increase across factor levels in a linear fashion, but there may also be quadratic and cubic trends.

8.3.5 An alternative to contrasts: Monotonic effects

An alternative to specifying contrasts to estimate specific comparisons between factor levels is monotonic effects (https://paul-buerkner.github.io/brms/articles/brms_monotonic.html; Bürkner and Charpentier 2020). This simply assumes that the dependent variable increases (or decreases) in a monotonic fashion across levels of an ordered factor. In this kind of analysis, one does not define contrasts specifying differences between (groups of) factor levels. Instead, one estimates one parameter which captures the average increase (or decrease) in the dependent variable associated with two neighboring factor levels. Moreover, one estimates the percentages of the overall increase (or decrease) that is associated with each of the differences between neighboring factor levels (i.e., similar to simple difference contrasts, but measured in percentage increase, and assuming monotonicity, i.e., that the same increase or decrease is present for all simple differences).

To implement a monotonic analysis, we first code the factor \(F\) as being an ordered factor (i.e.,ordered=TRUE; remember that the default ordering is alphabetical and may need to be changed). Then, we specify that we want to estimate a monotonic effect of \(F\) using the notation mo(F) in our call to brms:

df_contrasts3$F <- factor(df_contrasts3$F, ordered = TRUE)
fit_mo <- brm(DV ~ 1 + mo(F),
              data = df_contrasts3,
              family = gaussian(),
              prior = c(prior(normal(20, 50), class = Intercept),
                        prior(normal(0, 50), class = sigma),
                        prior(normal(0, 50), class = b)))
summary(fit_mo)
##  Family: gaussian 
##   Links: mu = identity; sigma = identity 
## Formula: DV ~ 1 + mo(F) 
##    Data: df_contrasts3 (Number of observations: 20) 
##   Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
##          total post-warmup draws = 4000
## 
## Regression Coefficients:
##           Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept     9.13      4.50    -0.32    17.27 1.00     1868     2050
## moF           9.61      2.46     4.65    14.47 1.00     1921     2208
## 
## Monotonic Simplex Parameters:
##         Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## moF1[1]     0.20      0.14     0.01     0.53 1.00     2422     1853
## moF1[2]     0.11      0.11     0.00     0.40 1.00     2271     2262
## moF1[3]     0.69      0.16     0.30     0.94 1.00     2448     2093
## 
## Further Distributional Parameters:
##       Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma    11.82      2.29     8.28    17.15 1.00     2318     2515
## 
## Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).

The results show that there is an overall positive population-level effect of the factor \(F\) with an estimate (moF) of \(9.61\), reflecting an average increase in the dependent variables of \(9.61\) with each level of \(F\). The model summary shows estimates for the simplex parameters, which represent the ratios of the overall increase associated with \(F\) that can be attributed to each of the differences between neighboring factor levels. The results show that most of the increase is associated with moF1[3], i.e., with the last difference, reflecting the difference between \(F3\) and \(F4\), whereas the other two differences (moF1[1], reflecting the difference between \(F1\) and \(F2\), and moF1[2], reflecting the difference between \(F2\) and \(F3\)) are smaller. Comparing conditional effects between a model using polynomial contrasts and a model assuming monotonic effects makes it clear that the current model “forces” the effects to increase in a monotonic fashion; see Figure 8.3.

ppoly <- conditional_effects(fit_Pol)
pmon  <- conditional_effects(fit_mo)
plot(ppoly, plot=FALSE)[[1]] + ggtitle("Polynomial contrasts")
plot(pmon, plot=FALSE)[[1]] + ggtitle("Monotonic effects")
Conditional effects using the polynomial contrasts on the left side vs. assuming monotonic effects on the right side.Conditional effects using the polynomial contrasts on the left side vs. assuming monotonic effects on the right side.

FIGURE 8.3: Conditional effects using the polynomial contrasts on the left side vs. assuming monotonic effects on the right side.

This is regardless of the information provided in the data; see the posterior predictive checks in Figure 8.4.


pp_check(fit_Pol, type = "violin_grouped",
         group = "F", y_draw = "points") +
  theme(legend.position = "bottom")+
  coord_cartesian(ylim = c(-55, 105)) +
  ggtitle("Polynomial contrasts")
pp_check(fit_mo, type = "violin_grouped",
         group = "F", y_draw = "points") +
  theme(legend.position = "bottom") +
  coord_cartesian(ylim = c(-55, 105)) +
  ggtitle("Monotonic effects")
Posterior predictive distributions by condition using the polynomial contrasts on the left side vs. assuming monotonic effects on the right side.Posterior predictive distributions by condition using the polynomial contrasts on the left side vs. assuming monotonic effects on the right side.

FIGURE 8.4: Posterior predictive distributions by condition using the polynomial contrasts on the left side vs. assuming monotonic effects on the right side.

The monotonicity assumption is violated in the current data set, since the mean is larger in condition \(F2\) than in condition \(F3\). The monotonic model thus assumes this (negative) difference is due to chance; see Figure 8.4.

Estimating such monotonic effects provides an alternative to the contrast coding we treat in the rest of this chapter. It may be relevant when the specific differences between factor levels are not of interest, but when instead the goal is to estimate the overall monotonic effect of a factor and when this overall effect is not well approximated by a simple linear trend.

8.4 What makes a good set of contrasts?

For a factor with \(I\) levels one can make only \(I-1\) linearly independent comparisons within a single model. For example, in a design with one factor with two levels, only one comparison is possible (between the two factor levels). The reason for this is that the intercept is also estimated. More generally, if we have a factor with \(I_1\) and another factor with \(I_2\) levels, then the total number of conditions is \(I_1\times I_2 = \nu\) (not \(I_1 + I_2\)), which implies a maximum of \(\nu-1\) contrasts.

For example, in a design with one factor with three levels, \(A\), \(B\), and \(C\), in principle one could make three comparisons (\(A\) vs. \(B\), \(A\) vs. \(C\), \(B\) vs. \(C\)). However, after defining an intercept, only two means can be compared (or two linear combinations of means). Therefore, for a factor with three levels, we define two comparisons within one statistical model.

One critical precondition for contrasts is that they implement different comparisons that are not collinear, that is, that none of the contrasts can be generated from the other contrasts by linear combination. For example, the contrast c1 = c(1,2,3) can be generated from the contrast c2 = c(3,4,5) simply by computing c2 - 2. Therefore, contrasts c1 and c2 cannot be used simultaneously. That is, each contrast needs to encode some independent information about the data.

There are (at least) three criteria to decide what a good contrast is. First, orthogonal contrasts have advantages as they estimate mutually independent comparisons in the data (see Dobson and Barnett 2011, sec. 6.2.5, p. 91 for a detailed explanation of orthogonality). Second, it is crucial that contrasts are defined in a way such that they answer the research questions. One way to accomplish this second point is to use the hypothesis matrix to generate contrasts (e.g., via the hypr package), as this ensures that one uses contrasts that exactly estimate the comparisons of interest in a given study. Third, in the Bayesian context, the scaling of contrasts is important, and has to be considered when defining priors: the scaling of the priors has to be tailored to the scaling of each individual contrasts. In situations where we define the same prior for all coefficients of a factor, the scaling of the contrasts has to be set accordingly.

8.4.1 Centered contrasts

Contrasts are often constrained to be centered, such that the individual contrast coefficients \(c_i\) for different factor levels \(i\) sum to \(0\): \(\sum_{i=1}^I c_i = 0\). This has advantages when estimating interactions with other factors or covariates (we discuss interactions between factors in the next chapter). All contrasts discussed here are centered except for the treatment contrast, in which the contrast coefficients for each contrast do not sum to zero:

colSums(contr.treatment(4))
## 2 3 4 
## 1 1 1

Other contrasts, such as repeated contrasts, are centered and the contrast coefficients for each contrast sum to \(0\):

colSums(contr.sdif(4))
## 2-1 3-2 4-3 
##   0   0   0

The contrast coefficients mentioned above appear in the contrast matrix. The weights in the hypothesis matrix are always centered. This is also true for the treatment contrast. The reason is that they code comparisons between conditions or bundles of conditions. The only exception are the weights for the intercept, which are all the same and together always sum to \(1\) in the hypothesis matrix. This is done to ensure that when applying the generalized matrix inverse, the intercept results in a constant term with values of \(1\) in the contrast matrix. An important question concerns whether (or when) the intercept needs to be considered in the generalized matrix inversion, and whether (or when) it can be ignored. This question is closely related to orthogonal contrasts, a concept we turn to below.

8.4.2 Orthogonal contrasts

Two centered contrasts \(c_1\) and \(c_2\) are orthogonal to each other if the following condition applies. Here, \(i\) is the \(i\)-th cell of the vector representing the contrast.

\[\begin{equation} \sum_{i=1}^I c_{1,i} \cdot c_{2,i} = 0 \end{equation}\]

Whether contrasts are orthogonal can often be determined easily by computing the correlation between two contrasts. Orthogonal contrasts have a correlation of \(0\).

For example, coding two factors in a \(2 \times 2\) design (we return to this case in a section on designs with two factors below) using sum contrasts, these sum contrasts and their interaction are orthogonal to each other:

(Xsum <- cbind(F1 = c(1, 1, -1, -1),
               F2 = c(1, -1, 1, -1),
               F1xF2 = c(1, -1, -1, 1)))
##      F1 F2 F1xF2
## [1,]  1  1     1
## [2,]  1 -1    -1
## [3,] -1  1    -1
## [4,] -1 -1     1
cor(Xsum)
##       F1 F2 F1xF2
## F1     1  0     0
## F2     0  1     0
## F1xF2  0  0     1

The correlations between the different contrasts (i.e., the off-diagonals) are exactly \(0\). Notice that sum contrasts coding one multi-level factor are not orthogonal to each other:

cor(contr.sum(4))
##      [,1] [,2] [,3]
## [1,]  1.0  0.5  0.5
## [2,]  0.5  1.0  0.5
## [3,]  0.5  0.5  1.0

Here, the correlations between individual contrasts, which appear in the off-diagonals, deviate from \(0\), indicating non-orthogonality. The same is also true for treatment and repeated contrasts:

cor(contr.sdif(4))
##       2-1   3-2   4-3
## 2-1 1.000 0.577 0.333
## 3-2 0.577 1.000 0.577
## 4-3 0.333 0.577 1.000
cor(contr.treatment(4))
##        2      3      4
## 2  1.000 -0.333 -0.333
## 3 -0.333  1.000 -0.333
## 4 -0.333 -0.333  1.000

Orthogonality of contrasts plays a critical role when computing the generalized inverse. In the inversion operation, orthogonal contrasts are converted independently from each other. That is, the presence or absence of another orthogonal contrast does not change the resulting weights. In fact, for orthogonal contrasts, applying the generalized matrix inverse to the hypothesis matrix simply furnishes a scaled version of the hypothesis matrix in the contrast matrix (for mathematical details see Schad et al. 2020).

In Bayesian models, scaling is always important, since we need to interpret the scale in order to define priors or interpret posteriors. Therefore, when working with contrasts in Bayesian models, the generalized matrix inverse is always a good procedure to use.

8.4.3 The role of the intercept in non-centered contrasts

A related question concerns whether the intercept needs to be considered when computing the generalized inverse for a contrast. This is of key importance when using the generalized matrix inverse to define contrasts: the resulting contrast matrix and also the definition of estimates can completely change between a situation where the intercept is explicitly considered or not considered, and can thus change the resulting estimates in possibly unintended ways. Thus, if the definition of the intercept is incorrect, the estimates of slopes may also be wrong.

More specifically, it turns out that considering the intercept is necessary for contrasts that are not centered. This is the case for treatment contrasts which are not centered; e.g., the treatment contrast for two factor levels c1vs0 = c(0,1): \(\sum_i c_i = 0 + 1 = 1\). One can actually show that the formula to determine whether contrasts are centered (i.e., \(\sum_i c_i = 0\)) is the same formula as the formula to test whether a contrast is “orthogonal to the intercept”. Remember that for the intercept, all contrast coefficients are equal to one: \(c_{1,i} = 1\) (here, \(c_{1,i}\) indicates the vector of contrast coefficients associated with the intercept). We enter these contrast coefficient values into the formula testing whether a contrast is orthogonal to the intercept (here, \(c_{2,i}\) indicates the vector of contrast coefficients associated with some contrast for which we want to test whether it is “orthogonal to the intercept”): \(\sum_i c_{1,i} \cdot c_{2,i} = \sum_i 1 \cdot c_{2,i} = \sum_i c_{2,i} = 0\). The resulting formula is: \(\sum_i c_{2,i} = 0\), which is exactly the formula for whether a contrast is centered. Because of this analogy, treatment contrasts can be viewed to be `not orthogonal to the intercept.’ This means that the intercept needs to be considered when computing the generalized inverse for treatment contrasts. As we have discussed above, when the intercept is included in the hypothesis matrix, the weights for this intercept term should sum to one, as this yields a column of ones for the intercept term in the contrast matrix.

We can see that considering the intercept makes a difference for the treatment contrast. First, we define the comparisons involved in a treatment contrast, where two experimental conditions b and c are each compared to a baseline condition a (b~a and c~a). In addition, we explicitly code the intercept term, which involves a comparison of the baseline to 0 (a~0). We take a look at the resulting contrast matrix:

hypr(int = a ~ 0, b1 = b ~ a, b2 = c ~ a)
## hypr object containing 3 null hypotheses:
## H0.int: 0 = a      (Intercept)
##  H0.b1: 0 = b - a
##  H0.b2: 0 = c - a
## 
## Call:
## hypr(int = ~a, b1 = ~b - a, b2 = ~c - a, levels = c("a", "b", 
## "c"))
## 
## Hypothesis matrix (transposed):
##   int b1 b2
## a  1  -1 -1
## b  0   1  0
## c  0   0  1
## 
## Contrast matrix:
##   int b1 b2
## a 1   0  0 
## b 1   1  0 
## c 1   0  1
contr.treatment(c("a", "b", "c"))
##   b c
## a 0 0
## b 1 0
## c 0 1

This shows a contrast matrix that we know from the treatment contrast. The intercept is coded as a column of ones. And each of the comparisons is coded as a \(1\) in the condition which is compared to the baseline, and a \(0\) in other conditions. The point is here that this gives us the contrast matrix that is expected and known for the treatment contrast.

We can also ignore the intercept in the specification of the comparisons:

hypr(b1 = m1 ~ m0, b2 = m2 ~ m0)
## hypr object containing 2 null hypotheses:
## H0.b1: 0 = m1 - m0
## H0.b2: 0 = m2 - m0
## 
## Call:
## hypr(b1 = ~m1 - m0, b2 = ~m2 - m0, levels = c("m0", "m1", "m2"
## ))
## 
## Hypothesis matrix (transposed):
##    b1 b2
## m0 -1 -1
## m1  1  0
## m2  0  1
## 
## Contrast matrix:
##    b1   b2  
## m0 -1/3 -1/3
## m1  2/3 -1/3
## m2 -1/3  2/3

Notice that the resulting contrast matrix now looks very different from the contrast matrix that we know from the treatment contrast. Indeed, this contrast also estimates a reasonable set of quantities. It again estimates whether the condition mean m1 differs from the baseline and whether m2 differs from baseline. However, the intercept now estimates the average dependent variable across all three conditions (i.e., the grand mean). This can be seen by explicitly adding a comparison of the average of all three conditions to \(0\):

hypr(int = (m0 + m1 + m2) / 3 ~ 0, b1 = m1 ~ m0, b2 = m2 ~ m0)
## hypr object containing 3 null hypotheses:
## H0.int: 0 = (m0 + m1 + m2)/3  (Intercept)
##  H0.b1: 0 = m1 - m0
##  H0.b2: 0 = m2 - m0
## 
## Call:
## hypr(int = ~1/3 * m0 + 1/3 * m1 + 1/3 * m2, b1 = ~m1 - m0, b2 = ~m2 - 
##     m0, levels = c("m0", "m1", "m2"))
## 
## Hypothesis matrix (transposed):
##    int b1  b2 
## m0 1/3  -1  -1
## m1 1/3   1   0
## m2 1/3   0   1
## 
## Contrast matrix:
##    int  b1   b2  
## m0    1 -1/3 -1/3
## m1    1  2/3 -1/3
## m2    1 -1/3  2/3

The last two columns of the resulting contrast matrix are now the same as when the intercept was ignored, which confirms that the two columns encode the same comparison.

8.5 Computing condition means from estimated contrasts

As mentioned earlier, one advantage of Bayesian modeling is that based on the posterior samples, it is possible to very flexibly compute new comparisons and estimates. Above (see section 8.1.4), we had discussed the case where the Bayesian model estimated the condition means instead of contrasts by removing the intercept from the brms model (the formula in brms was: DV ~ -1 + F). This allowed us to get posterior samples from each condition mean, and then to compute any possible comparison between condition means by subtracting the corresponding samples.

Importantly, posterior samples for the condition means can also be obtained after fitting a model with contrasts. We illustrate this here for the case of sum contrasts. Let’s use our above example of a design where we assess response times (in milliseconds, DV) for three different word classes adjectives, nouns, and verbs, that is, for a 3-level factor \(F\). In the above example, factor \(F\) was coded using a sum contrast, where the first contrast coded the difference of adjectives from the grand mean, and the second contrast coded the difference of nouns from the grand mean. This was the corresponding contrast matrix:

contrasts(df_contrasts2$F) <- contr.hypothesis(HcSum)
contrasts(df_contrasts2$F)
##            b1 b2
## adjectives  1  0
## nouns       0  1
## verbs      -1 -1

We had estimated a brms model for this data. The posterior estimates show results for the intercept (which is estimated to be \(450\) ms) and for our two coded comparisons. The effect FcH01 codes our first comparison that response times for adjectives differ from the grand mean, and show an estimate that response times for adjectives are about \(50\) ms shorter than the grand mean. Moreover, the effect FcH02 codes our second comparison that response times for nouns differ from the grand mean, and show the estimate that response times for nouns are \(50\) ms longer than the grand mean.

fixef(fit_Sum)
##           Estimate Est.Error  Q2.5 Q97.5
## Intercept    450.2      6.74 436.7 463.7
## FcH01        -49.6      9.82 -69.4 -29.4
## FcH02         49.6      9.95  29.8  69.4

Of course other comparisons might be of interest to us as well. For example, we might be interested in estimating how strongly response times for verbs differ from the grand mean.

To do so, one possible first step is to obtain the posteriors for the response times in each of the three conditions. How can this be done? The first step is to again extract the posterior samples from the model:

df_postSamp_Sum <- as_draws_df(fit_Sum)

We can see the samples for our first contrast (b_FcH01) and for our second contrast (b_FcH02). How can we now compute the posterior samples for each of the condition means, i.e., for adjectives, nouns, and verbs? For this, we need to take another look at the contrast matrix.

contrasts(df_contrasts2$F)
##            b1 b2
## adjectives  1  0
## nouns       0  1
## verbs      -1 -1

It tells us how the condition means are computed. For adjectives (see the first row of the contrast matrix), we can see that the response time is computed by taking \(1\) times the coefficient for b1 (i.e., FcH01) and \(0\) times the coefficient for b2 (i.e., FcH02). Thus, response times for adjectives are simply the samples for the b1 (i.e., FcH01) contrast. The contrast matrix does not show the intercept term, which is implicitly added. Thus, we also have to add the estimates for the intercept. Thus, the condition mean for adjectives is computed as follows:

df_postSamp_Sum <- df_postSamp_Sum %>%
  mutate(b_adjectives = b_Intercept + b_FcH01)

Similarly, we can obtain the posterior samples for the response times for nouns. The computation can be seen from the second row of the contrast matrix, which shows that the contrast b1 (i.e., FcH01) has weight \(0\) times, whereas the contrast b2 (i.e., FcH02) has weight \(1\). Adding the intercept thus gives:

df_postSamp_Sum <- df_postSamp_Sum %>%
  mutate(b_nouns = b_Intercept + b_FcH02)

Finally, we want to obtain posterior samples for the average response times for verbs. For verbs, the third row of the contrast matrix shows two times a \(-1\). Thus, contrasts b1 (i.e., FcH01) and b2 (i.e., FcH02) have to be subtracted from the intercept:

df_postSamp_Sum$b_verbs <-
  df_postSamp_Sum$b_Intercept - df_postSamp_Sum$b_FcH01 -
  df_postSamp_Sum$b_FcH02

This yields posterior samples for the mean response times for verbs.

We can now look at the posterior means and 95% credible intervals for adjectives, nouns, and verbs by computing the means and quantiles across all computed samples.

postTab <- df_postSamp_Sum %>%
  # remove the meta-data:
  as.data.frame() %>%
  select(b_adjectives, b_nouns, b_verbs) %>%
  # transform from wide to long with tidyr:
  pivot_longer(cols = everything(),
               names_to = "condition",
               values_to = "samp") %>%
  group_by(condition) %>%
  summarize(post_mean = mean(samp),
            `2.5%` = quantile(samp, p = 0.025),
            `97.5%` = quantile(samp, p = 0.975))
postTab
## # A tibble: 3 × 4
##   condition    post_mean `2.5%` `97.5%`
##   <chr>            <dbl>  <dbl>   <dbl>
## 1 b_adjectives      401.   377.    424.
## 2 b_nouns           500.   476.    524.
## 3 b_verbs           450.   426.    475.

The results show that as expected the posterior mean for adjectives is \(400\) ms, for nouns it is \(500\) ms, and for verbs, the posterior mean is \(450\) ms. Moreover, we have now posterior credible intervals for each of these estimates.

In fact, brms has a very convenient built-in function that allows us to compute these nested effects automatically (robust = FALSE shows the posterior mean; by default brms shows the posterior median). Notice that you need to add a [] after the function call, otherwise brms will plot the results.

conditional_effects(fit_Sum, robust = FALSE)[]
## $F
##            F  DV cond__  effect1__ estimate__ se__ lower__ upper__
## 1 adjectives 450      1 adjectives        401 11.8     377     424
## 2      nouns 450      1      nouns        500 12.1     476     524
## 3      verbs 450      1      verbs        450 12.1     426     475

The same function allows us to visualize the effects, as shown in Figure 8.5.

conditional_effects(fit_Sum, robust = FALSE)
Estimated condition means, computed from a `brms` model fitted with a sum contrast.

FIGURE 8.5: Estimated condition means, computed from a brms model fitted with a sum contrast.

Coming back to our hand-crafted computations, the posterior samples can be used to compute additional comparisons. For example, we might be interested in how much response times for verbs differ from the grand mean. This can be computed based on the samples for the condition means: we first compute the grand mean from the three condition means, and then we compare this to the estimate for verbs.

df_postSamp_Sum <- df_postSamp_Sum %>%
  mutate(GM = (b_adjectives + b_nouns + b_verbs) / 3,
         b_FcH03 = b_verbs - GM)
c(post_mean = mean(df_postSamp_Sum$b_FcH03),
  quantile(df_postSamp_Sum$b_FcH03, p = c(0.025, 0.975)))
## post_mean      2.5%     97.5% 
##   -0.0111  -20.8347   19.9165

The results show that reading times for verbs are quite the same as the grand mean, with a posterior mean estimate for the differences of nearly \(0\) ms, and with a 95% credible interval ranging between \(-20\) and \(+20\) ms.

The key message here is that based on the contrast matrix, it is possible to compute posterior samples for the condition means, and then to compute any arbitrary further comparisons or contrasts. We want to stress again that just obtaining the posterior distribution of a comparison does not allow us to argue that we have evidence for the effect; to argue that we have evidence for an effect being present/absent, we need Bayes factors. But the approach we outline above does allow us to obtain posterior means and credible intervals for arbitrary comparisons. However, different combinations of contrasts and priors may yield different results, especially when they are used to compute Bayes factors, due to their sensitivity for the priors.

We briefly show how to compute posterior samples for condition means for one more example contrast, namely for repeated contrasts. Here, the contrast matrix is:

contrasts(df_contrasts3$F) <- contr.hypothesis(HcRep)
contrasts(df_contrasts3$F)
##    c2vs1 c3vs2 c4vs3
## F1 -3/4  -1/2  -1/4 
## F2  1/4  -1/2  -1/4 
## F3  1/4   1/2  -1/4 
## F4  1/4   1/2   3/4

The model estimates were:

fixef(fit_Rep)
##           Estimate Est.Error   Q2.5 Q97.5
## Intercept    19.99      2.58  15.00 25.10
## Fc2vs1        9.57      6.94  -4.13 23.19
## Fc3vs2       -9.20      6.94 -22.95  4.23
## Fc4vs3       29.09      6.98  15.22 43.09

We first obtain the posterior samples for the contrasts:

df_postSamp_Rep <- as.data.frame(fit_Rep)

Then we compute the posterior samples for condition \(F1\). First, we have to add the intercept. Then, we can see in the contrast matrix that to compute the condition mean for \(F1\), we have to add up all contrasts, using the weights c(-3/4, -1/2, -1/4) for each of the three contrasts (see first row of the contrast matrix). Thus, the posterior samples are computed as follows:

df_postSamp_Rep <- df_postSamp_Rep %>%
  mutate(b_F1 = b_Intercept +
           -3 / 4 * b_Fc2vs1 +
           -1 / 2 * b_Fc3vs2 +
           -1 / 4 * b_Fc4vs3)

The other condition means are computed correspondingly:

df_postSamp_Rep <- df_postSamp_Rep %>%
  mutate(b_F2 = b_Intercept +
           1 / 4 * b_Fc2vs1 +
           -1 / 2 * b_Fc3vs2 +
           -1 / 4 * b_Fc4vs3,
         b_F3 = b_Intercept +
           1 / 4 * b_Fc2vs1 +
           1 / 2 * b_Fc3vs2 +
           -1 / 4 *b_Fc4vs3,
         b_F4 = b_Intercept +
           1 / 4 * b_Fc2vs1 +
           1 / 2 * b_Fc3vs2 +
           3 / 4 * b_Fc4vs3)

Now we can look at the posterior means and credible intervals:

postTab <- df_postSamp_Rep %>%
  select(b_F1, b_F2, b_F3, b_F4) %>%
  pivot_longer(cols = everything(),
               names_to = "condition",
               values_to = "samp") %>%
  group_by(condition) %>%
  summarize(post_mean = round(mean(samp)),
            `2.5%` = round(quantile(samp, p = 0.025)),
            `97.5%` = round(quantile(samp, p = 0.975)))
print(postTab, n = 4)
## # A tibble: 4 × 4
##   condition post_mean `2.5%` `97.5%`
##   <chr>         <dbl>  <dbl>   <dbl>
## 1 b_F1             10      0      20
## 2 b_F2             20     10      30
## 3 b_F3             11      0      21
## 4 b_F4             40     30      49

We can verify that brms function return the same values:

conditional_effects(fit_Rep, robust = FALSE)[]
## $F
##    F DV cond__ effect1__ estimate__ se__ lower__ upper__
## 1 F1 20      1        F1       10.1 4.96   0.174    19.8
## 2 F2 20      1        F2       19.7 5.00   9.845    29.7
## 3 F3 20      1        F3       10.5 5.07   0.190    20.6
## 4 F4 20      1        F4       39.6 4.93  29.562    49.3

The posterior means reflect exactly the means in the data (for comparison see Figure 8.2 and Table 8.3). We now have posterior samples for each of the conditions and can compute posterior credible intervals as well as new comparisons between conditions.

8.6 Summary

Contrasts in Bayesian models work in exactly the same way as in frequentist models. Contrasts provide a way to tell the model how to code factors into numeric covariates. That is, they provide a way to define which comparisons between which condition means or bundles of condition means should be estimated in the Bayesian model. There are a number of default contrasts, like treatment contrasts, sum contrasts, repeated contrasts, or Helmert contrasts, that estimate specific comparisons between condition means. A much more powerful procedure is to use the generalized matrix inverse, e.g., as implemented in the hypr package, to derive contrasts automatically after specifying the comparisons that a contrast should estimate. We have seen that in Bayesian models, it is quite straightforward to compute posterior samples for new contrasts post-hoc, after the model is fit. However, specifying precise contrasts is still of key importance when doing model comparisons (via Bayes factors) to answer the question of whether the data provide evidence for an effect of interest. If the effect of interest relates to a factor, then it has to be defined using contrast coding.

8.7 Further reading

A good discussion on contrast coding appears in Chapter 15 of Baguley (2012). A book-length treatment is by Rosenthal, Rosnow, and Rubin (2000). A brief discussion on contrast coding appears in Venables and Ripley (2002).

8.8 Exercises

Exercise 8.1 Contrast coding for a four-condition design

Load the following data. These data are from Experiment 1 in a set of reading studies on Persian (Safavi, Husain, and Vasishth 2016). This is a self-paced reading study on particle-verb constructions, with a \(2\times 2\) design: distance (short, long) and predictability (predictable, unpredictable). The data are from a critical region in the sentence. All the data from the Safavi, Husain, and Vasishth (2016) paper are available from https://github.com/vasishth/SafaviEtAl2016.

library(bcogsci)
data("df_persianE1")
dat1 <- df_persianE1
head(dat1)
##     subj item   rt distance   predability
## 60     4    6  568    short   predictable
## 94     4   17  517     long unpredictable
## 146    4   22  675    short   predictable
## 185    4    5  575     long unpredictable
## 215    4    3  581     long   predictable
## 285    4    7 1171     long   predictable

The four conditions are:

  • Distance=short and Predictability=unpredictable
  • Distance=short and Predictability=predictable
  • Distance=long and Predictability=unpredictable
  • Distance=long and Predictability=predictable

The researcher wants to do the following sets of comparisons between condition means:

Compare the condition labeled Distance=short and Predictability=unpredictable with each of the following conditions:

  • Distance=short and Predictability=predictable
  • Distance=long and Predictability=unpredictable
  • Distance=long and Predictability=predictable

Questions:

  • Which contrast coding is needed for such a comparison?
  • First, define the relevant contrast coding. Hint: You can do it by creating a condition column labeled a,b,c,d and then use a built-in contrast coding function.
  • Then, use the hypr library function to confirm that your contrast coding actually does the comparison you need.
  • Fit a simple linear model with the above contrast coding and display the slopes, which constitute the relevant comparisons.
  • Now, compute each of the four conditions’ means and check that the slopes from the linear model correspond to the relevant differences between means that you obtained from the data.

Exercise 8.2 Helmert coding for a six-condition design.

This data-set is from a psycholinguistics study, and although we explain the theoretical background below, one does not need to deeply understand the research questions to be able to define the contrasts.

Load the following data:

library(bcogsci)
data("df_polarity")
head(df_polarity)
##   subject item condition times value
## 1       1    6         f   SFD   328
## 2       1   24         f   SFD   206
## 3       1   35         e   SFD   315
## 4       1   17         e   SFD   265
## 5       1   34         d   SFD   252
## 6       1    7         a   SFD   156

The data come from an eyetracking study in German reported in Vasishth et al. (2008). The experiment is a reading study involving six conditions. The sentences are in English, but the original design was involved German sentences. In German, the word durchaus (certainly) is a positive polarity item: in the constructions used in this experiment, durchaus cannot have a c-commanding element that is a negative polarity item licensor. By contrast, the German negative polarity item jemals (ever) is a negative polarity item: in the constructions used in this experiment, jemals must have a c-commanding element that is a negative polarity item licensor.

Here are the conditions:

  • Negative polarity items
    1. Grammatical: No man who had a beard was ever thrifty.
    2. Ungrammatical (Intrusive NPI licensor): A man who had no beard was ever thrifty.
    3. Ungrammatical: A man who had a beard was ever thrifty.
  • Positive polarity items
    1. Ungrammatical: No man who had a beard was certainly thrifty.
    2. Grammatical (Intrusive NPI licensor): A man who had no beard was certainly thrifty.
    3. Grammatical: A man who had a beard was certainly thrifty.

We will focus only on re-reading time in this data set. Subset the data so that we only have re-reading times in the data frame:

dat2 <- subset(df_polarity, times == "RRT")
head(dat2)
##      subject item condition times value
## 6365       1   20         b   RRT   240
## 6366       1    3         c   RRT  1866
## 6367       1   13         a   RRT   530
## 6368       1   19         a   RRT   269
## 6369       1   27         c   RRT   845
## 6370       1   26         b   RRT   635

The comparisons we are interested in are as follows:

  • What is the difference in reading time between negative polarity items and positive polarity items? In other words, we want to compare the mean of conditions (a), (b), (c) with the mean of (d), (e), (f).
  • Within negative polarity items, what is the difference between grammatical and ungrammatical conditions? In other words, we want to compare condition (a) with the average of (b) and (c).
  • Within negative polarity items, what is the difference between the two ungrammatical conditions? Here, we want to compare conditions (b) and (c).
  • Within positive polarity items, what is the difference between grammatical and ungrammatical conditions? Here, we want to compare condition (d) with the average of (e) and (f).
  • Within positive polarity items, what is the difference between the two grammatical conditions? Here, the comparison is between (e) and (f).

Use the hypr package to specify the comparisons specified above, and then extract the contrast matrix. Finally, specify the contrasts to the condition column in the data frame. Fit a linear model using this contrast specification, and then check that the estimates from the model match the mean differences between the conditions being compared.

Exercise 8.3 Number of possible comparisons in a single model.

  • How many comparisons can one make in a single model when there is a single factor with four levels? Why can we not code four comparisons in a single model?
  • How many comparisons can one code in a model where there are two factors, one with three levels and one with two levels?
  • How about a model for a \(2 \times 2 \times 3\) design?

References

Baguley, Thomas. 2012. Serious Stats: A Guide to Advanced Statistics for the Behavioral Sciences. Macmillan International Higher Education.

Bolker, Ben. 2018. “Contrasts.” https://computationalcognitivescience.github.io/lovelace/.

Bürkner, Paul-Christian, and Emmanuel Charpentier. 2020. “Modelling Monotonic Effects of Ordinal Predictors in Bayesian Regression Models.” British Journal of Mathematical and Statistical Psychology. https://doi.org/https://doi.org/10.1111/bmsp.12195.

Dablander, Fabian, Karoline Huth, Quentin F. Gronau, Alexander Etz, and Eric-Jan Wagenmakers. 2022. “A Puzzle of Proportions: Two Popular Bayesian Tests Can Yield Dramatically Different Conclusions.” Statistics in Medicine 41 (8): 1319–33.

Dobson, Annette J., and Adrian Barnett. 2011. An Introduction to Generalized Linear Models. CRC Press.

Friendly, Michael, John Fox, and Phil Chalmers. 2020. Matlib: Matrix Functions for Teaching and Learning Linear Algebra and Multivariate Statistics. https://CRAN.R-project.org/package=matlib.

Heister, Julian, Kay-Michael Würzner, and Reinhold Kliegl. 2012. “Analysing Large Datasets of Eye Movements During Reading.” Visual Word Recognition 2: 102–30.

Rabe, Maximilian M., Shravan Vasishth, Sven Hohenstein, Reinhold Kliegl, and Daniel J. Schad. 2020a. “hypr: An R Package for Hypothesis-Driven Contrast Coding.” Journal of Open Source Software 5 (48): 2134.

Ripley, Brian D. 2023. MASS: Support Functions and Datasets for Venables and Ripley’s MASS. http://www.stats.ox.ac.uk/pub/MASS4/.

Rosenthal, Robert, Ralph L. Rosnow, and Donald B. Rubin. 2000. Contrasts and Effect Sizes in Behavioral Research: A Correlational Approach. Cambridge University Press.

Rouder, Jeffrey N, Richard D Morey, Paul L Speckman, and Jordan M Province. 2012. “Default Bayes Factors for Anova Designs.” Journal of Mathematical Psychology 56 (5): 356–74.

Safavi, Molood Sadat, Samar Husain, and Shravan Vasishth. 2016. “Dependency Resolution Difficulty Increases with Distance in Persian Separable Complex Predicates: Implications for Expectation and Memory-Based Accounts.” Frontiers in Psychology 7 (403). https://doi.org/10.3389/fpsyg.2016.00403.

Schad, Daniel J., Shravan Vasishth, Sven Hohenstein, and Reinhold Kliegl. 2019. “How to Capitalize on a Priori Contrasts in Linear (Mixed) Models: A Tutorial.” Journal of Memory and Language 110. https://doi.org/10.1016/j.jml.2019.104038.

2020. “How to Capitalize on a Priori Contrasts in Linear (Mixed) Models: A Tutorial.” Journal of Memory and Language 110 (February): 104038. https://doi.org/10/gf9tjp.

Vasishth, Shravan, Sven Bruessow, Richard L. Lewis, and Heiner Drenhaus. 2008. “Processing Polarity: How the Ungrammatical Intrudes on the Grammatical.” Cognitive Science 32 (4, 4): 685–712. https://doi.org/https://doi.org/10.1080/03640210802066865.

Vasishth, Shravan, Daniel J. Schad, Audrey Bürki, and Reinhold Kliegl. 2021. “Linear Mixed Models for Linguistics and Psychology: A Comprehensive Introduction.” https://vasishth.github.io/Freq_CogSci/.

Venables, William N., and Brian D. Ripley. 2002. Modern Applied Statistics with S-PLUS. New York: Springer.


  1. When using treatment coding, it’s important to be aware that the intercept won’t be centered. As mentioned in Box 4.2, this means the prior one sets for the intercept in brms will actually be for assigned to its centered version. Although this distinction may not have any impact when one uses wide priors, it’s still worth keeping it in mind.↩︎

  2. At this point, there is no need to understand in detail what this means. We refer the interested reader to Schad et al. (2020). For a quick overview, we recommend a vignette explaining the generalized inverse in the matlib package (https://cran.r-project.org/web/packages/matlib/vignettes/ginv.html; Friendly, Fox, and Chalmers 2020).↩︎

  3. If one is trying to set a general prior for differences between means, then the function bayestestR::contr.equalprior() can be used, which sets equal marginal priors on all means and differences between them.↩︎