Categories: stats / Tags: stan Bayesian r /

## The initial problem

I wrote what I thought it was the generative process for some modeling work, and it looked too common to not have a name, so I started to ask around in Twitter:

Prob q: Is there a name for a categorical distribution that whenever you get an outcome you remove it from the set?

— Bruno Nicenboim (@bruno_nicenboim) March 9, 2021

Ex. 1st draw: y ~ cat(p1,p2,p3,p4) ; y = 2

2nd draw y ~ cat(p1/k,p3/k,p4/k) (k = p1+p3+p4); y =1

3rd, y ~ categorical(p3/(p3+p4),p4/(p3+p4)); y =3

4th, it's y =4

One useful clue was this one:

If I understand you correctly, the situation is similar to a (literal) horse-race, as modeled here: https://t.co/TGGS1h4xxe

— Pavel Logačev (@pavellogacev) March 9, 2021

It turns out that the distribution that I was describing is in general used for rankings or ordered data and it is called an *exploded logit distribution*.^{1}

In this post, I’ll show how this model can be fit in the probabilistic programming language Stan, and how it can be used to describe the underlying order of ranking data.

I’m going to load some R packages that will be useful throughout this post.

```
library(dplyr) # Data manipulation
library(purrr) # List manipulation
library(ggplot2) # Nice plots
library(extraDistr) # More distributions
library(rcorpora) # Get random words
library(cmdstanr) # Lightweight Stan interface
library(bayesplot) # Nice Bayesian plots
```

## Ranking data

Ranking data appear when we care about the *underlying* order that certain elements have. We might want to know which are the best horses after looking at several races (Gakis et al. 2018), which is the best candidate for a job after a series of interviewers talked to several candidates. More in line with cognitive science, we might want to know which are the best possible completions for a sentence or the best exemplars of a category.

One way to get a ranking of exemplars of a category, for example, is to present them to participants and ask them to order all (or a subset) of them (see Barsalou 1985).

### A ranking simulation using pizza toppings

Let’s consider the following 25 pizza toppings:

```
toppings <- corpora("foods/pizzaToppings")$pizzaToppings
N_toppings <- length(toppings)
toppings
```

```
[1] "anchovies" "artichoke"
[3] "bacon" "breakfast bacon"
[5] "Canadian bacon" "cheese"
[7] "chicken" "chili peppers"
[9] "feta" "garlic"
[11] "green peppers" "grilled onions"
[13] "ground beef" "ham"
[15] "hot sauce" "meatballs"
[17] "mushrooms" "olives"
[19] "onions" "pepperoni"
[21] "pineapple" "sausage"
[23] "spinach" "sun-dried tomato"
[25] "tomatoes"
```

Let’s say that we want to know the underlying order of pizza toppings. For the modeling, I’m going to assume that the toppings are ordered according to an underlying value, which also represents how likely it is for each topping to be *the* exemplar of their category.

To get a known ground truth for the ranking, I’m going to simulate an order of pizza toppings. I assign probabilities that sum up to one to the toppings by drawing a random sample from a Dirichlet distribution. The Dirichlet distribution is the generalization of the Beta distribution. It has a concentration parameter, usually \(\boldsymbol{\alpha}\), which is a vector as long as the probabilities we are sampling (25 here). When the vector is full of ones, the distribution is uniform: All probabilities are equally likely, so on average each one is \(\frac{1}{vector \text{ } length}\) (\(\frac{1}{25}\) here). By setting all the concentration parameters below one (namely \(.2\)), I’m enforcing sparsity in the random values that I’m generating, that is, many probability values close to zero.

These is the true order that I’m assuming here:

```
# all elements of the vector are .5
alpha <- rep(.2, N_toppings)
# Generate one draw from a Dirichlet distribution
P_toppings <- c(rdirichlet(1, alpha)) %>%
# Add names
setNames(toppings) %>%
# Sort from the best exemplar
sort(decreasing = TRUE)
P_toppings %>%
round(3)
```

```
breakfast bacon chicken feta
0.294 0.241 0.087
anchovies sun-dried tomato olives
0.087 0.077 0.057
pepperoni artichoke cheese
0.056 0.049 0.010
Canadian bacon bacon ham
0.008 0.008 0.006
meatballs chili peppers garlic
0.004 0.004 0.004
ground beef tomatoes hot sauce
0.003 0.003 0.002
onions sausage pineapple
0.000 0.000 0.000
spinach mushrooms grilled onions
0.000 0.000 0.000
green peppers
0.000
```

- Given these values, if I were to ask a participant “What’s the most appropriate topping for a pizza?” I would assume that 29.37 percent of the time, I would get
*breakfast bacon*.

Essentially, we expect something like this to be happening.

\[\begin{equation} response \sim Categorical(\Theta_{toppings}) \tag{1} \end{equation}\]

With \(\Theta_{toppings}\) representing the different probabilities for each topping. The probability mass function of the categorical distribution is absurdly simple: It’s just the probability of the outcome.

\[\begin{equation} p(x = i) = \Theta_i \tag{2} \end{equation}\]

where \(i = \{\)breakfast bacon, chicken, feta, anchovies, sun-dried tomato, olives, pepperoni, artichoke, cheese, Canadian bacon, bacon, ham, meatballs, chili peppers, garlic, ground beef, tomatoes, hot sauce, onions, sausage, pineapple, spinach, mushrooms, grilled onions, green peppers\(\}\).

We can simulate this with 100 participants as follows:

`response <- rcat(100, P_toppings, names(P_toppings))`

And this should match approximately `P_toppings`

.

`table(response)/100`

```
response
breakfast bacon chicken feta
0.26 0.19 0.16
anchovies sun-dried tomato olives
0.07 0.15 0.06
pepperoni artichoke cheese
0.01 0.08 0.00
Canadian bacon bacon ham
0.02 0.00 0.00
meatballs chili peppers garlic
0.00 0.00 0.00
ground beef tomatoes hot sauce
0.00 0.00 0.00
onions sausage pineapple
0.00 0.00 0.00
spinach mushrooms grilled onions
0.00 0.00 0.00
green peppers
0.00
```

*It seems that by only asking participants to give the best topping we could already deduce the underlying order…*

True, but one motivation for considering ranking data is the amount of information that we gather with a list due to their combinatorial nature. If we ask participants to rank \(n\) items, an answer consists in making a single selection out of \(n!\) possibilities. Ordering 7 pizza toppings, for example, constitutes making a single selection out of 5040 possibilities!

If we don’t relay on lists and there is sparcity, it requires a large number of participants until we get answers of low probability. (For example, we’ll need a very large number of participants until we hear something else but *hammer* as an exemplar of tools).

- Now, what happens if we ask about the second most appropriate topping for a pizza?

Now we need to exclude the first topping that was given, and draw another sample from a categorical distribution. (We don’t allow the participant to repeat toppings, that is, to say that the best topping is pineapple and the second best is also pineapple). This means that now the probability of the topping already given is zero, and that we need to normalize our original probability values by dividing them by the new total probability (which will be lower than 1).

Here, the probability of getting the element \(j\) (where \(j \neq i\)) is

\[\begin{equation} p(x = j) = \frac{\Theta_j}{\sum \Theta_{[-i]}} \tag{3} \end{equation}\]

where \(\Theta_{[-i]}\) represents the probabilities of all the outcomes except of \(i\), which was the first one.

- We can go on with the third best topping, where we need to normalize the remaining probabilities by dividing by the new sum of probabilities.

\[\begin{equation} p(x = k) = \frac{\Theta_k}{\sum \Theta_{[-i,-j]}} \tag{4} \end{equation}\]

- We can do this until we get to the last element, which will be drawn with probability 1.

**And this is the exploded logit distribution.**

This process can be simulated in R as follows:

```
rexploded <- function(n, ranked = 3, prob, labels = NULL){
#run n times
lapply(1:n, function(nn){
res <- rep(NA,ranked)
if(!is.null(labels)){
res <- factor(res, labels)
} else {
# if there are no labels, just 1,2,3,...
labels <- seq_along(prob)
}
for(i in 1:ranked){
# normalize the probability so that it sums to 1
prob <- prob/sum(prob)
res[i] <- rcat(1, prob = prob, labels = labels)
# remove the choice from the set:
prob[res[i]] <- 0
}
res
})
}
```

If we would like to simulate 50 subjects creating a ranking of the best 7 toppings, we would do the following:

```
res <- rexploded(n = 50,
ranked = 7,
prob = P_toppings,
labels = names(P_toppings))
# subject 1:
res[[1]]
```

```
[1] sun-dried tomato artichoke olives
[4] breakfast bacon chicken pepperoni
[7] anchovies
25 Levels: breakfast bacon chicken feta ... green peppers
```

We have simulated ranking data of pizza toppings, can we recover the original probability values and “discover” the underlying order?

## Fitting the exploded logistic distribution in Stan

To fit the model in Stan, I’m going to create a custom probability mass function that takes an array of integers, `x`

, which represents a set of rankings, and a vector of probability values, `theta`

, that sums up to one.

The logic of this function is that the probability mass function of a ranking \(\{i,j,k, \ldots, N \}\) can be written as a product of normalized categorical distributions (where the first one is just divided by 1).

\[\begin{equation} p(x = \{i,j,k,\ldots\}) = \frac{\Theta_i}{\sum \Theta} \cdot \frac{\Theta_j}{\sum \Theta_{[-i]}} \cdot \frac{\Theta_k}{\sum \Theta_{[-i, -j]}} \ldots \tag{5} \end{equation}\]

For Stan, we need the log-PDF. In log-space, products become sums, and divisions differences, and the log of \(\sum \Theta\) will be zero:

\[\begin{equation} \begin{aligned} log(p(x = \{i,j,k,\ldots\})) =& \log(\Theta_i) - log(\sum \Theta) \\ & + \log(\Theta_j) - \log(\sum \Theta_{[-i]}) \\ &+ \log(\Theta_k) -\log(\sum \Theta_{[-i, -j]}) \\ & + \ldots \end{aligned} \tag{5} \end{equation}\]

The following Stan custom function follows this logic but iterating over the rankings. In each iteration, it aggregates in the variable `out`

the addends of the log probability mass function, and turns the probability of selecting again the already ranked element to zero.

```
real exploded_lpmf(int[] x, vector Theta){
real out = 0;
vector[num_elements(Theta)] thetar = Theta;
for(pos in x){
out += log(thetar[pos]) - log(sum(thetar));
thetar[pos] = 0;
}
return(out);
}
```

The whole model named `exploded.stan`

includes the usual data declaration, the parameter `Theta`

declared as a simplex (i.e., it sums to one), and a uniform Dirichlet prior for `Theta`

. (I’m assuming that I don’t know how sparse the probabilities are).

```
functions {
real exploded_lpmf(int[] x, vector Theta){
real out = 0;
vector[num_elements(Theta)] thetar = Theta;
for(pos in x){
out += log(thetar[pos]) - log(sum(thetar));
thetar[pos] = 0;
}
return(out);
}
}
data{
int N_ranking; //total times the choices were ranked
int N_ranked; //total choices ranked
int N_options; //total options
int res[N_ranking, N_ranked];
}
parameters {
simplex[N_options] Theta;
}
model {
target += dirichlet_lpdf(Theta| rep_vector(1, N_options));
for(r in 1:N_ranking){
target += exploded_lpmf(res[r]|Theta);
}
}
```

Let’s see if I can recover the parameter values.

```
# Makethe list of lists into a matrix
res_matrix <- t(sapply(res, as.numeric))
ldata <- list(res = res_matrix,
N_ranked = length(res[[1]]),
N_options = length(P_toppings),
N_ranking = length(res))
m_expl <- cmdstan_model("./exploded.stan")
f_exploded <- m_expl$sample(
data = ldata,
seed = 123,
parallel_chains = 4
)
f_exploded
```

```
variable mean median sd mad q5 q95 rhat ess_bulk ess_tail
lp__ -724.91 -724.59 3.59 3.49 -731.39 -719.59 1.00 1262 2290
Theta[1] 0.29 0.29 0.04 0.04 0.23 0.35 1.00 5575 2538
Theta[2] 0.27 0.27 0.04 0.04 0.21 0.33 1.00 5320 3114
Theta[3] 0.08 0.08 0.01 0.01 0.06 0.10 1.00 6830 3167
Theta[4] 0.09 0.08 0.01 0.01 0.06 0.11 1.00 5790 2746
Theta[5] 0.07 0.07 0.01 0.01 0.05 0.09 1.00 6440 3142
Theta[6] 0.05 0.04 0.01 0.01 0.03 0.06 1.00 6608 3177
Theta[7] 0.04 0.04 0.01 0.01 0.03 0.06 1.00 6373 2697
Theta[8] 0.05 0.05 0.01 0.01 0.03 0.06 1.00 5620 2863
Theta[9] 0.01 0.01 0.00 0.00 0.00 0.01 1.00 7235 2664
# showing 10 of 26 rows (change via 'max_rows' argument or 'cmdstanr_max_rows' option)
```

I plot the posterior distributions of the probability values and the true probability values below.

```
mcmc_recover_hist(f_exploded$draws("Theta"),
P_toppings,
facet_args =
list(scales = "fixed", ncol = 3)) +
theme(legend.position="bottom")
```

It looks reasonable. However, if we really want to be sure that this is working, we should probably use simulation based calibration (Talts et al. 2018).

## What is this good for?

This super simple example shows how to get an underlying ranking based on a set of responses from a number of subjects. It’s straightforward to adapt this model to data from participants ranking elements from different sets of the *same size* (e.g., 7 out of 25 toppings, 7 out of 25 tools). It’s a little less straightforward if the sets are of different sizes, e.g., rank 7 toppings out 25, but 7 tools out 50. This is just because Stan doesn’t allow ragged arrays. See here some tips for implementing the latter model.

## Could this be used as a cognitive model of people’s rankings?

Maybe. And I enter here in the realm of half baked research, ideal for a blog post.

Lee, Steyvers, and Miller (2014) show the implementation of a cognitive model for rank order data from the latent knowledge of participants, which is based on Thurstonian models (Thurstone 1927, 1931) fitted with Bayesian methods in JAGS (Johnson and Kuhn 2013).

The exploded logit model seems to be closely related to the Thurstonian model. The Thurstonian model assumes that each participant assigns an underlying score to each item of a set, which is drawn from a true score with normally distributed error. The score determines the order that the participant gives. We can think about the exploded logit similarly. While I modeled the underlying ranking based on probability values, one could assume that each participant \(s\) had their own score \(mu_{is}\) for each item (or pizza topping) \(i\), which is built as a common score \(mu_i\) together with some individual deviation \(\epsilon_{is}\):

\[\begin{equation} \mu_{is} = \mu_i + \epsilon_{is} \tag{6} \end{equation}\]

If we assume that \(\epsilon_{is}\) has a Gumbel distribution, then the probability of \(\mu_{is}\) being ranked first out of N options is determined by a softmax function:

\[\begin{equation} P(i) = \frac{\exp(\mu_i)}{\sum \exp(\mu)} \tag{7} \end{equation}\]

where \(\mu\) is the vector of scores for all elements of the set.

And the probability of ordering \(j\) second is:

\[\begin{equation} P(i,j,\ldots) = \frac{\exp(\mu_j)}{\sum \exp(\mu_{[-i]} )} \tag{8} \end{equation}\]

and so forth.

These last equations are essentially the same categorical distributions that I used before in (2) and (3), but the softmax function converts the unbounded scores into probabilities first. However, with the exploded logit, the error term goes away leading to a more tractable model. This is not the case for the Thurstonian model. The Thurstonian model is more complex, but at the same time we gain more flexibility. With the error term, the Thurstonian model can incorporate the reliability of the participants’ judgments and even correlations, which, as far as I know, can’t be included in the exploded logit model.

## Session info^{2}

`sessionInfo()`

```
R version 4.1.1 (2021-08-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.3 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=nl_NL.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=nl_NL.UTF-8
[6] LC_MESSAGES=en_US.UTF-8 LC_PAPER=nl_NL.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=nl_NL.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] bayesplot_1.8.1 cmdstanr_0.4.0.9000 rcorpora_2.0.0 extraDistr_1.9.1 ggplot2_3.3.5 purrr_0.3.4 dplyr_1.0.7
loaded via a namespace (and not attached):
[1] tidyselect_1.1.1 xfun_0.26 bslib_0.3.0 reshape2_1.4.4 colorspace_2.0-2 vctrs_0.3.8 generics_0.1.1
[8] htmltools_0.5.2 yaml_2.2.1 utf8_1.2.2 rlang_0.4.12 jquerylib_0.1.4 pillar_1.6.4 glue_1.5.1
[15] withr_2.4.2 DBI_1.1.1 distributional_0.2.2 matrixStats_0.60.1 lifecycle_1.0.1 plyr_1.8.6 stringr_1.4.0
[22] posterior_1.1.0 munsell_0.5.0 blogdown_1.5.2 gtable_0.3.0 evaluate_0.14 labeling_0.4.2 papaja_0.1.0.9997
[29] knitr_1.36 fastmap_1.1.0 ps_1.6.0 fansi_0.5.0 highr_0.9 Rcpp_1.0.7 scales_1.1.1
[36] backports_1.2.1 checkmate_2.0.0 jsonlite_1.7.2 abind_1.4-5 farver_2.1.0 tensorA_0.36.2 digest_0.6.28
[43] stringi_1.7.6 processx_3.5.2 bookdown_0.22 grid_4.1.1 tools_4.1.1 magrittr_2.0.1 sass_0.4.0
[50] tibble_3.1.5 crayon_1.4.2 pkgconfig_2.0.3 ellipsis_0.3.2 data.table_1.14.2 ggridges_0.5.3 assertthat_0.2.1
[57] rmarkdown_2.11 R6_2.5.1 compiler_4.1.1
```

## References

*Journal of Experimental Psychology: Learning, Memory, and Cognition*11 (4): 629.

*Journal of Econometrics*17 (1): 1–19. https://doi.org/https://doi.org/10.1016/0304-4076(81)90056-7.

*Shiny: Web Application Framework for r*. https://CRAN.R-project.org/package=shiny.

*Cmdstanr: R Interface to ’CmdStan’*.

*J. R. Stat. Soc. A*182: 389–402. https://doi.org/10.1111/rssa.12378.

*Athens Journal of Sports*5 (2): 95–114.

*Purrr: Functional Programming Tools*. https://CRAN.R-project.org/package=purrr.

*Behavior Research Methods*45 (3): 857–72. https://doi.org/10.3758/s13428-012-0300-3.

*Rcorpora: A Collection of Small Text Corpora of Interesting Data*. https://CRAN.R-project.org/package=rcorpora.

*PLOS ONE*9 (5): e96431. https://doi.org/10.1371/journal.pone.0096431.

*Individual Choice Behavior : A Theoretical Analysis*. Book. Wiley N.Y.

*Journal of the Royal Statistical Society. Series C (Applied Statistics)*24 (2): 193–202. http://www.jstor.org/stable/2346567.

*R: A Language and Environment for Statistical Computing*. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

*Psychological Reviews*34: 273–86.

*Journal of Experimental Psychology*14 (3): 187.

*Ggplot2: Elegant Graphics for Data Analysis*. Springer-Verlag New York. https://ggplot2.tidyverse.org.

*Dplyr: A Grammar of Data Manipulation*. https://CRAN.R-project.org/package=dplyr.

*extraDistr: Additional Univariate and Multivariate Distributions*. https://CRAN.R-project.org/package=extraDistr.

*Mediumr: R Interface to ’Medium’ API*.

This model is also called the

*rank ordered logit model*(Beggs, Cardell, and Hausman 1981) or Plackett–Luce model due to Plackett (1975) and Luce (1959), but I liked the explosion part more.↩︎I used R [Version 4.1.1; R Core Team (2020)] and the R-packages

*bayesplot*[Version 1.8.1; Gabry et al. (2019)],*cmdstanr*[Version 0.4.0.9000; Gabry and Češnovar (2020)],*dplyr*[Version 1.0.7; Wickham et al. (2021)],*extraDistr*[Version 1.9.1; Wolodzko (2020)],*ggplot2*[Version 3.3.5; Wickham (2016)],*mediumr*(Yutani 2021),*purrr*[Version 0.3.4; Henry and Wickham (2020)],*rcorpora*[Version 2.0.0; Kazemi et al. (2018)], and*shiny*[Version 1.6.0; Chang et al. (2021)] to generate this document.↩︎

### Citation:

For attribution, please cite this work as

```
Bruno Nicenboim. (2021). "A simple way to model rankings with Stan". Retrieved from /2021/03/21/a-simple-way-to-model-rankings-with-stan/.
```

BibTex citation

#####
@misc{Bruno Nicenboim,
author = { Bruno Nicenboim },
title = {A simple way to model rankings with Stan},
howpublished = {\url{/2021/03/21/a-simple-way-to-model-rankings-with-stan/}},
year = {2021}
}

@misc{Bruno Nicenboim, author = { Bruno Nicenboim }, title = {A simple way to model rankings with Stan}, howpublished = {\url{/2021/03/21/a-simple-way-to-model-rankings-with-stan/}}, year = {2021} }