Categories: stats / Tags: stan Bayesian r /
Update Feb 1st, 2024: I updated the Stan syntax.
The initial problem
I wrote what I thought it was the generative process for some modeling work, and it looked too common to not have a name, so I started to ask around in Twitter:
Prob q: Is there a name for a categorical distribution that whenever you get an outcome you remove it from the set?
— @bruno_nicenboim@fediscience.org (@bruno_nicenboim) March 9, 2021
Ex. 1st draw: y ~ cat(p1,p2,p3,p4) ; y = 2
2nd draw y ~ cat(p1/k,p3/k,p4/k) (k = p1+p3+p4); y =1
3rd, y ~ categorical(p3/(p3+p4),p4/(p3+p4)); y =3
4th, it's y =4
One useful clue was this one:
It turns out that the distribution that I was describing is in general used for rankings or ordered data and it is called an exploded logit distribution.1
In this post, I’ll show how this model can be fit in the probabilistic programming language Stan, and how it can be used to describe the underlying order of ranking data.
I’m going to load some R packages that will be useful throughout this post.
library(dplyr) # Data manipulation
library(purrr) # List manipulation
library(ggplot2) # Nice plots
library(extraDistr) # More distributions
library(rcorpora) # Get random words
library(cmdstanr) # Lightweight Stan interface
library(bayesplot) # Nice Bayesian plots
Ranking data
Ranking data appear when we care about the underlying order that certain elements have. We might want to know which are the best horses after looking at several races (Gakis et al. 2018), which is the best candidate for a job after a series of interviewers talked to several candidates. More in line with cognitive science, we might want to know which are the best possible completions for a sentence or the best exemplars of a category.
One way to get a ranking of exemplars of a category, for example, is to present them to participants and ask them to order all (or a subset) of them (see Barsalou 1985).
A ranking simulation using pizza toppings
Let’s consider the following 25 pizza toppings:
toppings <- corpora("foods/pizzaToppings")$pizzaToppings
N_toppings <- length(toppings)
toppings
[1] "anchovies" "artichoke"
[3] "bacon" "breakfast bacon"
[5] "Canadian bacon" "cheese"
[7] "chicken" "chili peppers"
[9] "feta" "garlic"
[11] "green peppers" "grilled onions"
[13] "ground beef" "ham"
[15] "hot sauce" "meatballs"
[17] "mushrooms" "olives"
[19] "onions" "pepperoni"
[21] "pineapple" "sausage"
[23] "spinach" "sun-dried tomato"
[25] "tomatoes"
Let’s say that we want to know the underlying order of pizza toppings. For the modeling, I’m going to assume that the toppings are ordered according to an underlying value, which also represents how likely it is for each topping to be the exemplar of their category.
To get a known ground truth for the ranking, I’m going to simulate an order of pizza toppings. I assign probabilities that sum up to one to the toppings by drawing a random sample from a Dirichlet distribution. The Dirichlet distribution is the generalization of the Beta distribution. It has a concentration parameter, usually \(\boldsymbol{\alpha}\), which is a vector as long as the probabilities we are sampling (25 here). When the vector is full of ones, the distribution is uniform: All probabilities are equally likely, so on average each one is \(\frac{1}{vector \text{ } length}\) (\(\frac{1}{25}\) here). By setting all the concentration parameters below one (namely \(.2\)), I’m enforcing sparsity in the random values that I’m generating, that is, many probability values close to zero.
These is the true order that I’m assuming here:
# all elements of the vector are .5
alpha <- rep(.2, N_toppings)
# Generate one draw from a Dirichlet distribution
P_toppings <- c(rdirichlet(1, alpha)) %>%
# Add names
setNames(toppings) %>%
# Sort from the best exemplar
sort(decreasing = TRUE)
P_toppings %>%
round(3)
artichoke breakfast bacon onions
0.387 0.140 0.124
ground beef hot sauce spinach
0.091 0.090 0.072
sun-dried tomato anchovies feta
0.036 0.016 0.010
green peppers bacon Canadian bacon
0.006 0.006 0.006
chili peppers pepperoni garlic
0.005 0.005 0.003
tomatoes ham mushrooms
0.002 0.001 0.001
meatballs sausage olives
0.000 0.000 0.000
grilled onions chicken cheese
0.000 0.000 0.000
pineapple
0.000
- Given these values, if I were to ask a participant “What’s the most appropriate topping for a pizza?” I would assume that 38.74 percent of the time, I would get artichoke.
Essentially, we expect something like this to be happening.
\[\begin{equation} response \sim Categorical(\Theta_{toppings}) \tag{1} \end{equation}\]
With \(\Theta_{toppings}\) representing the different probabilities for each topping. The probability mass function of the categorical distribution is absurdly simple: It’s just the probability of the outcome.
\[\begin{equation} p(x = i) = \Theta_i \tag{2} \end{equation}\]
where \(i = \{\)artichoke, breakfast bacon, onions, ground beef, hot sauce, spinach, sun-dried tomato, anchovies, feta, green peppers, bacon, Canadian bacon, chili peppers, pepperoni, garlic, tomatoes, ham, mushrooms, meatballs, sausage, olives, grilled onions, chicken, cheese, pineapple\(\}\).
We can simulate this with 100 participants as follows:
response <- rcat(100, P_toppings, names(P_toppings))
And this should match approximately P_toppings
.
table(response)/100
response
artichoke breakfast bacon onions
0.31 0.12 0.17
ground beef hot sauce spinach
0.11 0.14 0.04
sun-dried tomato anchovies feta
0.07 0.02 0.02
green peppers bacon Canadian bacon
0.00 0.00 0.00
chili peppers pepperoni garlic
0.00 0.00 0.00
tomatoes ham mushrooms
0.00 0.00 0.00
meatballs sausage olives
0.00 0.00 0.00
grilled onions chicken cheese
0.00 0.00 0.00
pineapple
0.00
It seems that by only asking participants to give the best topping we could already deduce the underlying order…
True, but one motivation for considering ranking data is the amount of information that we gather with a list due to their combinatorial nature. If we ask participants to rank \(n\) items, an answer consists in making a single selection out of \(n!\) possibilities. Ordering 7 pizza toppings, for example, constitutes making a single selection out of 5040 possibilities!
If we don’t relay on lists and there is sparcity, it requires a large number of participants until we get answers of low probability. (For example, we’ll need a very large number of participants until we hear something else but hammer as an exemplar of tools).
- Now, what happens if we ask about the second most appropriate topping for a pizza?
Now we need to exclude the first topping that was given, and draw another sample from a categorical distribution. (We don’t allow the participant to repeat toppings, that is, to say that the best topping is pineapple and the second best is also pineapple). This means that now the probability of the topping already given is zero, and that we need to normalize our original probability values by dividing them by the new total probability (which will be lower than 1).
Here, the probability of getting the element \(j\) (where \(j \neq i\)) is
\[\begin{equation} p(x = j) = \frac{\Theta_j}{\sum \Theta_{[-i]}} \tag{3} \end{equation}\]
where \(\Theta_{[-i]}\) represents the probabilities of all the outcomes except of \(i\), which was the first one.
- We can go on with the third best topping, where we need to normalize the remaining probabilities by dividing by the new sum of probabilities.
\[\begin{equation} p(x = k) = \frac{\Theta_k}{\sum \Theta_{[-i,-j]}} \tag{4} \end{equation}\]
- We can do this until we get to the last element, which will be drawn with probability 1.
And this is the exploded logit distribution.
This process can be simulated in R as follows:
rexploded <- function(n, ranked = 3, prob, labels = NULL){
#run n times
lapply(1:n, function(nn){
res <- rep(NA,ranked)
if(!is.null(labels)){
res <- factor(res, labels)
} else {
# if there are no labels, just 1,2,3,...
labels <- seq_along(prob)
}
for(i in 1:ranked){
# normalize the probability so that it sums to 1
prob <- prob/sum(prob)
res[i] <- rcat(1, prob = prob, labels = labels)
# remove the choice from the set:
prob[res[i]] <- 0
}
res
})
}
If we would like to simulate 50 subjects creating a ranking of the best 7 toppings, we would do the following:
res <- rexploded(n = 50,
ranked = 7,
prob = P_toppings,
labels = names(P_toppings))
# subject 1:
res[[1]]
[1] hot sauce artichoke breakfast bacon
[4] onions ground beef sun-dried tomato
[7] spinach
25 Levels: artichoke breakfast bacon onions ... pineapple
We have simulated ranking data of pizza toppings, can we recover the original probability values and “discover” the underlying order?
Fitting the exploded logistic distribution in Stan
To fit the model in Stan, I’m going to create a custom probability mass function that takes an array of integers, x
, which represents a set of rankings, and a vector of probability values, theta
, that sums up to one.
The logic of this function is that the probability mass function of a ranking \(\{i,j,k, \ldots, N \}\) can be written as a product of normalized categorical distributions (where the first one is just divided by 1).
\[\begin{equation} p(x = \{i,j,k,\ldots\}) = \frac{\Theta_i}{\sum \Theta} \cdot \frac{\Theta_j}{\sum \Theta_{[-i]}} \cdot \frac{\Theta_k}{\sum \Theta_{[-i, -j]}} \ldots \tag{5} \end{equation}\]
For Stan, we need the log-PDF. In log-space, products become sums, and divisions differences, and the log of \(\sum \Theta\) will be zero:
\[\begin{equation} \begin{aligned} log(p(x = \{i,j,k,\ldots\})) =& \log(\Theta_i) - log(\sum \Theta) \\ & + \log(\Theta_j) - \log(\sum \Theta_{[-i]}) \\ &+ \log(\Theta_k) -\log(\sum \Theta_{[-i, -j]}) \\ & + \ldots \end{aligned} \tag{5} \end{equation}\]
The following Stan custom function follows this logic but iterating over the rankings. In each iteration, it aggregates in the variable out
the addends of the log probability mass function, and turns the probability of selecting again the already ranked element to zero.
real exploded_lpmf(array[] int x, vector Theta){
real out = 0;
vector[num_elements(Theta)] thetar = Theta;
for(pos in x){
out += log(thetar[pos]) - log(sum(thetar));
thetar[pos] = 0;
}
return(out);
}
The whole model named exploded.stan
includes the usual data declaration, the parameter Theta
declared as a simplex (i.e., it sums to one), and a uniform Dirichlet prior for Theta
. (I’m assuming that I don’t know how sparse the probabilities are).
functions {
real exploded_lpmf(array[] int x, vector Theta){
real out = 0;
vector[num_elements(Theta)] thetar = Theta;
for(pos in x){
out += log(thetar[pos]) - log(sum(thetar));
thetar[pos] = 0;
}
return(out);
}
}
data{
int N_ranking; //total times the choices were ranked
int N_ranked; //total choices ranked
int N_options; //total options
array[N_ranking, N_ranked] int res;
}
parameters {
simplex[N_options] Theta;
}
model {
target += dirichlet_lpdf(Theta| rep_vector(1, N_options));
for(r in 1:N_ranking){
target += exploded_lpmf(res[r]|Theta);
}
}
Let’s see if I can recover the parameter values.
# Makethe list of lists into a matrix
res_matrix <- t(sapply(res, as.numeric))
ldata <- list(res = res_matrix,
N_ranked = length(res[[1]]),
N_options = length(P_toppings),
N_ranking = length(res))
m_expl <- cmdstan_model("./exploded.stan")
f_exploded <- m_expl$sample(
data = ldata,
seed = 123,
parallel_chains = 4
)
f_exploded
variable mean median sd mad q5 q95 rhat ess_bulk ess_tail
lp__ -688.83 -688.43 3.67 3.67 -695.50 -683.60 1.00 1381 2171
Theta[1] 0.35 0.35 0.04 0.04 0.28 0.42 1.00 5554 2867
Theta[2] 0.16 0.16 0.02 0.02 0.13 0.20 1.00 5578 2692
Theta[3] 0.13 0.13 0.02 0.02 0.10 0.17 1.00 5535 2966
Theta[4] 0.08 0.08 0.01 0.01 0.06 0.11 1.00 5656 3210
Theta[5] 0.10 0.10 0.01 0.01 0.08 0.12 1.00 6052 2845
Theta[6] 0.06 0.06 0.01 0.01 0.04 0.08 1.00 5893 3232
Theta[7] 0.04 0.04 0.01 0.01 0.03 0.05 1.00 5861 3159
Theta[8] 0.02 0.02 0.00 0.00 0.01 0.03 1.00 6491 3193
Theta[9] 0.01 0.01 0.00 0.00 0.00 0.01 1.00 5362 2888
# showing 10 of 26 rows (change via 'max_rows' argument or 'cmdstanr_max_rows' option)
I plot the posterior distributions of the probability values and the true probability values below.
mcmc_recover_hist(f_exploded$draws("Theta"),
P_toppings,
facet_args =
list(scales = "fixed", ncol = 3)) +
theme(legend.position="bottom")
It looks reasonable. However, if we really want to be sure that this is working, we should probably use simulation based calibration (Talts et al. 2018).
What is this good for?
This super simple example shows how to get an underlying ranking based on a set of responses from a number of subjects. It’s straightforward to adapt this model to data from participants ranking elements from different sets of the same size (e.g., 7 out of 25 toppings, 7 out of 25 tools). It’s a little less straightforward if the sets are of different sizes, e.g., rank 7 toppings out 25, but 7 tools out 50. This is just because Stan doesn’t allow ragged arrays. See here some tips for implementing the latter model.
Could this be used as a cognitive model of people’s rankings?
Maybe. And I enter here in the realm of half baked research, ideal for a blog post.
Lee, Steyvers, and Miller (2014) show the implementation of a cognitive model for rank order data from the latent knowledge of participants, which is based on Thurstonian models (Thurstone 1927, 1931) fitted with Bayesian methods in JAGS (Johnson and Kuhn 2013).
The exploded logit model seems to be closely related to the Thurstonian model. The Thurstonian model assumes that each participant assigns an underlying score to each item of a set, which is drawn from a true score with normally distributed error. The score determines the order that the participant gives. We can think about the exploded logit similarly. While I modeled the underlying ranking based on probability values, one could assume that each participant \(s\) had their own score \(mu_{is}\) for each item (or pizza topping) \(i\), which is built as a common score \(mu_i\) together with some individual deviation \(\epsilon_{is}\):
\[\begin{equation} \mu_{is} = \mu_i + \epsilon_{is} \tag{6} \end{equation}\]
If we assume that \(\epsilon_{is}\) has a Gumbel distribution, then the probability of \(\mu_{is}\) being ranked first out of N options is determined by a softmax function:
\[\begin{equation} P(i) = \frac{\exp(\mu_i)}{\sum \exp(\mu)} \tag{7} \end{equation}\]
where \(\mu\) is the vector of scores for all elements of the set.
And the probability of ordering \(j\) second is:
\[\begin{equation} P(i,j,\ldots) = \frac{\exp(\mu_j)}{\sum \exp(\mu_{[-i]} )} \tag{8} \end{equation}\]
and so forth.
These last equations are essentially the same categorical distributions that I used before in (2) and (3), but the softmax function converts the unbounded scores into probabilities first. However, with the exploded logit, the error term goes away leading to a more tractable model. This is not the case for the Thurstonian model. The Thurstonian model is more complex, but at the same time we gain more flexibility. With the error term, the Thurstonian model can incorporate the reliability of the participants’ judgments and even correlations, which, as far as I know, can’t be included in the exploded logit model.
Session info 2
sessionInfo()
R version 4.3.2 (2023-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.3 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=nl_NL.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=nl_NL.UTF-8
[6] LC_MESSAGES=en_US.UTF-8 LC_PAPER=nl_NL.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=nl_NL.UTF-8 LC_IDENTIFICATION=C
time zone: Europe/Amsterdam
tzcode source: system (glibc)
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] bayesplot_1.10.0 cmdstanr_0.7.1 rcorpora_2.0.0 extraDistr_1.9.1 ggplot2_3.4.4.9000 purrr_1.0.2 dplyr_1.1.3
loaded via a namespace (and not attached):
[1] gtable_0.3.4 tensorA_0.36.2.1 xfun_0.41 bslib_0.5.1 processx_3.8.3 papaja_0.1.2 vctrs_0.6.5
[8] tools_4.3.2 ps_1.7.6 generics_0.1.3 tibble_3.2.1 fansi_1.0.6 highr_0.10 RefManageR_1.4.0
[15] pkgconfig_2.0.3 data.table_1.14.10 checkmate_2.3.1 distributional_0.3.2 lifecycle_1.0.4 compiler_4.3.2 farver_2.1.1
[22] stringr_1.5.0 munsell_0.5.0 tinylabels_0.2.4 httpuv_1.6.12 htmltools_0.5.7 sass_0.4.7 yaml_2.3.7
[29] later_1.3.1 pillar_1.9.0 jquerylib_0.1.4 ellipsis_0.3.2 cachem_1.0.8 abind_1.4-5 mime_0.12
[36] posterior_1.5.0 tidyselect_1.2.0 digest_0.6.34 stringi_1.8.1 reshape2_1.4.4 bookdown_0.36 labeling_0.4.3
[43] bibtex_0.5.1 fastmap_1.1.1 grid_4.3.2 colorspace_2.1-0 cli_3.6.2 magrittr_2.0.3 utf8_1.2.4
[50] withr_3.0.0 promises_1.2.1 scales_1.3.0 backports_1.4.1 lubridate_1.9.3 timechange_0.2.0 rmarkdown_2.25
[57] httr_1.4.7 matrixStats_1.2.0 blogdown_1.18 shiny_1.7.5.1 evaluate_0.23 knitr_1.45 rlang_1.1.3
[64] Rcpp_1.0.12 xtable_1.8-4 glue_1.7.0 xml2_1.3.5 rstudioapi_0.15.0 jsonlite_1.8.8 R6_2.5.1
[71] plyr_1.8.9
References
This model is also called the rank ordered logit model (Beggs, Cardell, and Hausman 1981) or Plackett–Luce model due to Plackett (1975) and Luce (1959), but I liked the explosion part more.↩︎
I used R (Version 4.3.2; R Core Team 2020) and the R-packages bayesplot (Version 1.10.0; Gabry et al. 2019), cmdstanr (Version 0.7.1; Gabry and Češnovar 2020), dplyr (Version 1.1.3; Wickham et al. 2021), extraDistr (Version 1.9.1; Wolodzko 2020), ggplot2 (Version 3.4.4.9000; Wickham 2016), htmltools (Version 0.5.7; Cheng et al. 2023), knitr (Version 1.45; Xie 2015), mediumr (Yutani 2021), purrr (Version 1.0.2; Henry and Wickham 2020), rcorpora (Version 2.0.0; Kazemi et al. 2018), RefManageR (Version 1.4.0; McLean 2017), rmarkdown (Version 2.25; Xie, Allaire, and Grolemund 2018; Xie, Dervieux, and Riederer 2020), shiny (Version 1.7.5.1; Chang et al. 2021), and stringr (Version 1.5.0; Wickham 2022) to generate this document.↩︎