A simple way to model rankings with Stan

Bruno Nicenboim / 2021-03-21 15 min read
Categories: stats / Tags: stan Bayesian r /

Update Feb 1st, 2024: I updated the Stan syntax.

The initial problem

I wrote what I thought it was the generative process for some modeling work, and it looked too common to not have a name, so I started to ask around in Twitter:

One useful clue was this one:

It turns out that the distribution that I was describing is in general used for rankings or ordered data and it is called an exploded logit distribution.1

In this post, I’ll show how this model can be fit in the probabilistic programming language Stan, and how it can be used to describe the underlying order of ranking data.

I’m going to load some R packages that will be useful throughout this post.

library(dplyr) # Data manipulation
library(purrr) # List manipulation
library(ggplot2) # Nice plots
library(extraDistr) # More distributions
library(rcorpora) # Get random words
library(cmdstanr) # Lightweight Stan interface
library(bayesplot) # Nice Bayesian plots

Ranking data

Ranking data appear when we care about the underlying order that certain elements have. We might want to know which are the best horses after looking at several races (Gakis et al. 2018), which is the best candidate for a job after a series of interviewers talked to several candidates. More in line with cognitive science, we might want to know which are the best possible completions for a sentence or the best exemplars of a category.

One way to get a ranking of exemplars of a category, for example, is to present them to participants and ask them to order all (or a subset) of them (see Barsalou 1985).

A ranking simulation using pizza toppings

Let’s consider the following 25 pizza toppings:

toppings <- corpora("foods/pizzaToppings")$pizzaToppings
N_toppings <- length(toppings)
toppings
 [1] "anchovies"        "artichoke"       
 [3] "bacon"            "breakfast bacon" 
 [5] "Canadian bacon"   "cheese"          
 [7] "chicken"          "chili peppers"   
 [9] "feta"             "garlic"          
[11] "green peppers"    "grilled onions"  
[13] "ground beef"      "ham"             
[15] "hot sauce"        "meatballs"       
[17] "mushrooms"        "olives"          
[19] "onions"           "pepperoni"       
[21] "pineapple"        "sausage"         
[23] "spinach"          "sun-dried tomato"
[25] "tomatoes"        

Let’s say that we want to know the underlying order of pizza toppings. For the modeling, I’m going to assume that the toppings are ordered according to an underlying value, which also represents how likely it is for each topping to be the exemplar of their category.

To get a known ground truth for the ranking, I’m going to simulate an order of pizza toppings. I assign probabilities that sum up to one to the toppings by drawing a random sample from a Dirichlet distribution. The Dirichlet distribution is the generalization of the Beta distribution. It has a concentration parameter, usually \(\boldsymbol{\alpha}\), which is a vector as long as the probabilities we are sampling (25 here). When the vector is full of ones, the distribution is uniform: All probabilities are equally likely, so on average each one is \(\frac{1}{vector \text{ } length}\) (\(\frac{1}{25}\) here). By setting all the concentration parameters below one (namely \(.2\)), I’m enforcing sparsity in the random values that I’m generating, that is, many probability values close to zero.

These is the true order that I’m assuming here:

# all elements of the vector are .5
alpha <- rep(.2, N_toppings)
# Generate one draw from a Dirichlet distribution
P_toppings <- c(rdirichlet(1, alpha)) %>%
  # Add names
  setNames(toppings) %>%
  # Sort from the best exemplar
  sort(decreasing = TRUE)
P_toppings %>%
  round(3)
       artichoke  breakfast bacon           onions 
           0.387            0.140            0.124 
     ground beef        hot sauce          spinach 
           0.091            0.090            0.072 
sun-dried tomato        anchovies             feta 
           0.036            0.016            0.010 
   green peppers            bacon   Canadian bacon 
           0.006            0.006            0.006 
   chili peppers        pepperoni           garlic 
           0.005            0.005            0.003 
        tomatoes              ham        mushrooms 
           0.002            0.001            0.001 
       meatballs          sausage           olives 
           0.000            0.000            0.000 
  grilled onions          chicken           cheese 
           0.000            0.000            0.000 
       pineapple 
           0.000 
  • Given these values, if I were to ask a participant “What’s the most appropriate topping for a pizza?” I would assume that 38.74 percent of the time, I would get artichoke.

Essentially, we expect something like this to be happening.

\[\begin{equation} response \sim Categorical(\Theta_{toppings}) \tag{1} \end{equation}\]

With \(\Theta_{toppings}\) representing the different probabilities for each topping. The probability mass function of the categorical distribution is absurdly simple: It’s just the probability of the outcome.

\[\begin{equation} p(x = i) = \Theta_i \tag{2} \end{equation}\]

where \(i = \{\)artichoke, breakfast bacon, onions, ground beef, hot sauce, spinach, sun-dried tomato, anchovies, feta, green peppers, bacon, Canadian bacon, chili peppers, pepperoni, garlic, tomatoes, ham, mushrooms, meatballs, sausage, olives, grilled onions, chicken, cheese, pineapple\(\}\).

We can simulate this with 100 participants as follows:

response <- rcat(100, P_toppings, names(P_toppings))

And this should match approximately P_toppings.

table(response)/100
response
       artichoke  breakfast bacon           onions 
            0.31             0.12             0.17 
     ground beef        hot sauce          spinach 
            0.11             0.14             0.04 
sun-dried tomato        anchovies             feta 
            0.07             0.02             0.02 
   green peppers            bacon   Canadian bacon 
            0.00             0.00             0.00 
   chili peppers        pepperoni           garlic 
            0.00             0.00             0.00 
        tomatoes              ham        mushrooms 
            0.00             0.00             0.00 
       meatballs          sausage           olives 
            0.00             0.00             0.00 
  grilled onions          chicken           cheese 
            0.00             0.00             0.00 
       pineapple 
            0.00 

It seems that by only asking participants to give the best topping we could already deduce the underlying order…

True, but one motivation for considering ranking data is the amount of information that we gather with a list due to their combinatorial nature. If we ask participants to rank \(n\) items, an answer consists in making a single selection out of \(n!\) possibilities. Ordering 7 pizza toppings, for example, constitutes making a single selection out of 5040 possibilities!

If we don’t relay on lists and there is sparcity, it requires a large number of participants until we get answers of low probability. (For example, we’ll need a very large number of participants until we hear something else but hammer as an exemplar of tools).

  • Now, what happens if we ask about the second most appropriate topping for a pizza?

Now we need to exclude the first topping that was given, and draw another sample from a categorical distribution. (We don’t allow the participant to repeat toppings, that is, to say that the best topping is pineapple and the second best is also pineapple). This means that now the probability of the topping already given is zero, and that we need to normalize our original probability values by dividing them by the new total probability (which will be lower than 1).

Here, the probability of getting the element \(j\) (where \(j \neq i\)) is

\[\begin{equation} p(x = j) = \frac{\Theta_j}{\sum \Theta_{[-i]}} \tag{3} \end{equation}\]

where \(\Theta_{[-i]}\) represents the probabilities of all the outcomes except of \(i\), which was the first one.

  • We can go on with the third best topping, where we need to normalize the remaining probabilities by dividing by the new sum of probabilities.

\[\begin{equation} p(x = k) = \frac{\Theta_k}{\sum \Theta_{[-i,-j]}} \tag{4} \end{equation}\]

  • We can do this until we get to the last element, which will be drawn with probability 1.

And this is the exploded logit distribution.

This process can be simulated in R as follows:

rexploded <-  function(n, ranked = 3, prob, labels = NULL){
  #run n times
  lapply(1:n, function(nn){
    res <- rep(NA,ranked)
    if(!is.null(labels)){
      res <- factor(res, labels)
    } else {
      # if there are no labels, just 1,2,3,...
      labels <- seq_along(prob)
    }
  for(i in 1:ranked){
    # normalize the probability so that it sums to 1
    prob <- prob/sum(prob)
    res[i] <- rcat(1, prob = prob, labels = labels)
    # remove the choice from the set:
    prob[res[i]] <- 0
  }
    res
  })
}

If we would like to simulate 50 subjects creating a ranking of the best 7 toppings, we would do the following:

res <- rexploded(n = 50,
                 ranked = 7,
                 prob = P_toppings,
                 labels = names(P_toppings))
# subject 1:
res[[1]]
[1] hot sauce        artichoke        breakfast bacon 
[4] onions           ground beef      sun-dried tomato
[7] spinach         
25 Levels: artichoke breakfast bacon onions ... pineapple

We have simulated ranking data of pizza toppings, can we recover the original probability values and “discover” the underlying order?

Fitting the exploded logistic distribution in Stan

To fit the model in Stan, I’m going to create a custom probability mass function that takes an array of integers, x, which represents a set of rankings, and a vector of probability values, theta, that sums up to one.

The logic of this function is that the probability mass function of a ranking \(\{i,j,k, \ldots, N \}\) can be written as a product of normalized categorical distributions (where the first one is just divided by 1).

\[\begin{equation} p(x = \{i,j,k,\ldots\}) = \frac{\Theta_i}{\sum \Theta} \cdot \frac{\Theta_j}{\sum \Theta_{[-i]}} \cdot \frac{\Theta_k}{\sum \Theta_{[-i, -j]}} \ldots \tag{5} \end{equation}\]

For Stan, we need the log-PDF. In log-space, products become sums, and divisions differences, and the log of \(\sum \Theta\) will be zero:

\[\begin{equation} \begin{aligned} log(p(x = \{i,j,k,\ldots\})) =& \log(\Theta_i) - log(\sum \Theta) \\ & + \log(\Theta_j) - \log(\sum \Theta_{[-i]}) \\ &+ \log(\Theta_k) -\log(\sum \Theta_{[-i, -j]}) \\ & + \ldots \end{aligned} \tag{5} \end{equation}\]

The following Stan custom function follows this logic but iterating over the rankings. In each iteration, it aggregates in the variable out the addends of the log probability mass function, and turns the probability of selecting again the already ranked element to zero.

 real exploded_lpmf(array[] int x, vector Theta){
    real out = 0;
    vector[num_elements(Theta)] thetar = Theta;
    for(pos in x){
      out += log(thetar[pos]) - log(sum(thetar));
      thetar[pos] = 0;
      }
     return(out);
 }

The whole model named exploded.stan includes the usual data declaration, the parameter Theta declared as a simplex (i.e., it sums to one), and a uniform Dirichlet prior for Theta. (I’m assuming that I don’t know how sparse the probabilities are).

functions {
 real exploded_lpmf(array[] int x, vector Theta){
    real out = 0;
    vector[num_elements(Theta)] thetar = Theta;
    for(pos in x){
      out += log(thetar[pos]) - log(sum(thetar));
      thetar[pos] = 0;
      }
     return(out);
 }
}
data{
  int N_ranking; //total times the choices were ranked
  int N_ranked; //total choices ranked
  int N_options; //total options
  array[N_ranking, N_ranked] int res;
}
parameters {
  simplex[N_options] Theta;
}
model {
  target += dirichlet_lpdf(Theta| rep_vector(1, N_options));
  for(r in 1:N_ranking){
    target += exploded_lpmf(res[r]|Theta);
  }
}

Let’s see if I can recover the parameter values.

# Makethe list of lists into a matrix
res_matrix <- t(sapply(res, as.numeric))
ldata <- list(res = res_matrix, 
              N_ranked = length(res[[1]]), 
              N_options = length(P_toppings), 
              N_ranking = length(res)) 

m_expl <- cmdstan_model("./exploded.stan")

f_exploded <- m_expl$sample(
  data = ldata,
  seed = 123,
  parallel_chains = 4
)

f_exploded
 variable    mean  median   sd  mad      q5     q95 rhat ess_bulk ess_tail
 lp__     -688.83 -688.43 3.67 3.67 -695.50 -683.60 1.00     1381     2171
 Theta[1]    0.35    0.35 0.04 0.04    0.28    0.42 1.00     5554     2867
 Theta[2]    0.16    0.16 0.02 0.02    0.13    0.20 1.00     5578     2692
 Theta[3]    0.13    0.13 0.02 0.02    0.10    0.17 1.00     5535     2966
 Theta[4]    0.08    0.08 0.01 0.01    0.06    0.11 1.00     5656     3210
 Theta[5]    0.10    0.10 0.01 0.01    0.08    0.12 1.00     6052     2845
 Theta[6]    0.06    0.06 0.01 0.01    0.04    0.08 1.00     5893     3232
 Theta[7]    0.04    0.04 0.01 0.01    0.03    0.05 1.00     5861     3159
 Theta[8]    0.02    0.02 0.00 0.00    0.01    0.03 1.00     6491     3193
 Theta[9]    0.01    0.01 0.00 0.00    0.00    0.01 1.00     5362     2888

 # showing 10 of 26 rows (change via 'max_rows' argument or 'cmdstanr_max_rows' option)

I plot the posterior distributions of the probability values and the true probability values below.

mcmc_recover_hist(f_exploded$draws("Theta"),
                  P_toppings,
                  facet_args =
                    list(scales = "fixed", ncol = 3)) +
  theme(legend.position="bottom")

It looks reasonable. However, if we really want to be sure that this is working, we should probably use simulation based calibration (Talts et al. 2018).

What is this good for?

This super simple example shows how to get an underlying ranking based on a set of responses from a number of subjects. It’s straightforward to adapt this model to data from participants ranking elements from different sets of the same size (e.g., 7 out of 25 toppings, 7 out of 25 tools). It’s a little less straightforward if the sets are of different sizes, e.g., rank 7 toppings out 25, but 7 tools out 50. This is just because Stan doesn’t allow ragged arrays. See here some tips for implementing the latter model.

Could this be used as a cognitive model of people’s rankings?

Maybe. And I enter here in the realm of half baked research, ideal for a blog post.

Lee, Steyvers, and Miller (2014) show the implementation of a cognitive model for rank order data from the latent knowledge of participants, which is based on Thurstonian models (Thurstone 1927, 1931) fitted with Bayesian methods in JAGS (Johnson and Kuhn 2013).

The exploded logit model seems to be closely related to the Thurstonian model. The Thurstonian model assumes that each participant assigns an underlying score to each item of a set, which is drawn from a true score with normally distributed error. The score determines the order that the participant gives. We can think about the exploded logit similarly. While I modeled the underlying ranking based on probability values, one could assume that each participant \(s\) had their own score \(mu_{is}\) for each item (or pizza topping) \(i\), which is built as a common score \(mu_i\) together with some individual deviation \(\epsilon_{is}\):

\[\begin{equation} \mu_{is} = \mu_i + \epsilon_{is} \tag{6} \end{equation}\]

If we assume that \(\epsilon_{is}\) has a Gumbel distribution, then the probability of \(\mu_{is}\) being ranked first out of N options is determined by a softmax function:

\[\begin{equation} P(i) = \frac{\exp(\mu_i)}{\sum \exp(\mu)} \tag{7} \end{equation}\]

where \(\mu\) is the vector of scores for all elements of the set.

And the probability of ordering \(j\) second is:

\[\begin{equation} P(i,j,\ldots) = \frac{\exp(\mu_j)}{\sum \exp(\mu_{[-i]} )} \tag{8} \end{equation}\]

and so forth.

These last equations are essentially the same categorical distributions that I used before in (2) and (3), but the softmax function converts the unbounded scores into probabilities first. However, with the exploded logit, the error term goes away leading to a more tractable model. This is not the case for the Thurstonian model. The Thurstonian model is more complex, but at the same time we gain more flexibility. With the error term, the Thurstonian model can incorporate the reliability of the participants’ judgments and even correlations, which, as far as I know, can’t be included in the exploded logit model.

Session info 2

sessionInfo()
R version 4.3.2 (2023-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.3 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0 
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=nl_NL.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=nl_NL.UTF-8   
 [6] LC_MESSAGES=en_US.UTF-8    LC_PAPER=nl_NL.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=nl_NL.UTF-8 LC_IDENTIFICATION=C       

time zone: Europe/Amsterdam
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] bayesplot_1.10.0   cmdstanr_0.7.1     rcorpora_2.0.0     extraDistr_1.9.1   ggplot2_3.4.4.9000 purrr_1.0.2        dplyr_1.1.3       

loaded via a namespace (and not attached):
 [1] gtable_0.3.4         tensorA_0.36.2.1     xfun_0.41            bslib_0.5.1          processx_3.8.3       papaja_0.1.2         vctrs_0.6.5         
 [8] tools_4.3.2          ps_1.7.6             generics_0.1.3       tibble_3.2.1         fansi_1.0.6          highr_0.10           RefManageR_1.4.0    
[15] pkgconfig_2.0.3      data.table_1.14.10   checkmate_2.3.1      distributional_0.3.2 lifecycle_1.0.4      compiler_4.3.2       farver_2.1.1        
[22] stringr_1.5.0        munsell_0.5.0        tinylabels_0.2.4     httpuv_1.6.12        htmltools_0.5.7      sass_0.4.7           yaml_2.3.7          
[29] later_1.3.1          pillar_1.9.0         jquerylib_0.1.4      ellipsis_0.3.2       cachem_1.0.8         abind_1.4-5          mime_0.12           
[36] posterior_1.5.0      tidyselect_1.2.0     digest_0.6.34        stringi_1.8.1        reshape2_1.4.4       bookdown_0.36        labeling_0.4.3      
[43] bibtex_0.5.1         fastmap_1.1.1        grid_4.3.2           colorspace_2.1-0     cli_3.6.2            magrittr_2.0.3       utf8_1.2.4          
[50] withr_3.0.0          promises_1.2.1       scales_1.3.0         backports_1.4.1      lubridate_1.9.3      timechange_0.2.0     rmarkdown_2.25      
[57] httr_1.4.7           matrixStats_1.2.0    blogdown_1.18        shiny_1.7.5.1        evaluate_0.23        knitr_1.45           rlang_1.1.3         
[64] Rcpp_1.0.12          xtable_1.8-4         glue_1.7.0           xml2_1.3.5           rstudioapi_0.15.0    jsonlite_1.8.8       R6_2.5.1            
[71] plyr_1.8.9          

References

Barsalou, Lawrence W. 1985. “Ideals, Central Tendency, and Frequency of Instantiation as Determinants of Graded Structure in Categories.” Journal of Experimental Psychology: Learning, Memory, and Cognition 11 (4): 629.
Beggs, S, S Cardell, and J Hausman. 1981. “Assessing the Potential Demand for Electric Cars.” Journal of Econometrics 17 (1): 1–19. https://doi.org/https://doi.org/10.1016/0304-4076(81)90056-7.
Chang, Winston, Joe Cheng, JJ Allaire, Carson Sievert, Barret Schloerke, Yihui Xie, Jeff Allen, Jonathan McPherson, Alan Dipert, and Barbara Borges. 2021. Shiny: Web Application Framework for r. https://CRAN.R-project.org/package=shiny.
Cheng, Joe, Carson Sievert, Barret Schloerke, Winston Chang, Yihui Xie, and Jeff Allen. 2023. Htmltools: Tools for HTML. https://CRAN.R-project.org/package=htmltools.
Gabry, Jonah, and Rok Češnovar. 2020. Cmdstanr: R Interface to ’CmdStan’.
Gabry, Jonah, Daniel Simpson, Aki Vehtari, Michael Betancourt, and Andrew Gelman. 2019. “Visualization in Bayesian Workflow.” J. R. Stat. Soc. A 182: 389–402. https://doi.org/10.1111/rssa.12378.
Gakis, Konstantinos, Panos Pardalos, Chang-Hwan Choi, Jae-Hyeon Park, and Jiwun Yoon. 2018. “Simulation of a Probabilistic Model for Multi-Contestant Races.” Athens Journal of Sports 5 (2): 95–114.
Henry, Lionel, and Hadley Wickham. 2020. Purrr: Functional Programming Tools. https://CRAN.R-project.org/package=purrr.
Johnson, Timothy R., and Kristine M. Kuhn. 2013. “Bayesian Thurstonian Models for Ranking Data Using JAGS.” Behavior Research Methods 45 (3): 857–72. https://doi.org/10.3758/s13428-012-0300-3.
Kazemi, Darius, Cole Willsea, Serin Delaunay, Karl Swedberg, Matthew Rothenberg, Greg Kennedy, Nathaniel Mitchell, et al. 2018. Rcorpora: A Collection of Small Text Corpora of Interesting Data. https://CRAN.R-project.org/package=rcorpora.
Lee, Michael D., Mark Steyvers, and Brent Miller. 2014. “A Cognitive Model for Aggregating People’s Rankings.” PLOS ONE 9 (5): e96431. https://doi.org/10.1371/journal.pone.0096431.
Luce, R. Duncan. 1959. Individual Choice Behavior : A Theoretical Analysis. Book. Wiley N.Y.
McLean, Mathew William. 2017. “RefManageR: Import and Manage BibTeX and BibLaTeX References in r.” The Journal of Open Source Software. https://doi.org/10.21105/joss.00338.
Plackett, R. L. 1975. “The Analysis of Permutations.” Journal of the Royal Statistical Society. Series C (Applied Statistics) 24 (2): 193–202. http://www.jstor.org/stable/2346567.
R Core Team. 2020. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.
Talts, Sean, Michael Betancourt, Daniel Simpson, Aki Vehtari, and Andrew Gelman. 2018. “Validating Bayesian Inference Algorithms with Simulation-Based Calibration,” April. http://arxiv.org/abs/1804.06788.
Thurstone, Louis L. 1927. “A Law of Comparative Judgement.” Psychological Reviews 34: 273–86.
———. 1931. “Rank Order as a Psycho-Physical Method.” Journal of Experimental Psychology 14 (3): 187.
Wickham, Hadley. 2016. Ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. https://ggplot2.tidyverse.org.
———. 2022. Stringr: Simple, Consistent Wrappers for Common String Operations. https://CRAN.R-project.org/package=stringr.
Wickham, Hadley, Romain François, Lionel Henry, and Kirill Müller. 2021. Dplyr: A Grammar of Data Manipulation. https://CRAN.R-project.org/package=dplyr.
Wolodzko, Tymoteusz. 2020. extraDistr: Additional Univariate and Multivariate Distributions. https://CRAN.R-project.org/package=extraDistr.
Xie, Yihui. 2015. Dynamic Documents with R and Knitr. 2nd ed. Boca Raton, Florida: Chapman; Hall/CRC. https://yihui.org/knitr/.
Xie, Yihui, J. J. Allaire, and Garrett Grolemund. 2018. R Markdown: The Definitive Guide. Boca Raton, Florida: Chapman; Hall/CRC. https://bookdown.org/yihui/rmarkdown.
Xie, Yihui, Christophe Dervieux, and Emily Riederer. 2020. R Markdown Cookbook. Boca Raton, Florida: Chapman; Hall/CRC. https://bookdown.org/yihui/rmarkdown-cookbook.
Yutani, Hiroaki. 2021. Mediumr: R Interface to ’Medium’ API.

  1. This model is also called the rank ordered logit model (Beggs, Cardell, and Hausman 1981) or Plackett–Luce model due to Plackett (1975) and Luce (1959), but I liked the explosion part more.↩︎

  2. I used R (Version 4.3.2; R Core Team 2020) and the R-packages bayesplot (Version 1.10.0; Gabry et al. 2019), cmdstanr (Version 0.7.1; Gabry and Češnovar 2020), dplyr (Version 1.1.3; Wickham et al. 2021), extraDistr (Version 1.9.1; Wolodzko 2020), ggplot2 (Version 3.4.4.9000; Wickham 2016), htmltools (Version 0.5.7; Cheng et al. 2023), knitr (Version 1.45; Xie 2015), mediumr (Yutani 2021), purrr (Version 1.0.2; Henry and Wickham 2020), rcorpora (Version 2.0.0; Kazemi et al. 2018), RefManageR (Version 1.4.0; McLean 2017), rmarkdown (Version 2.25; Xie, Allaire, and Grolemund 2018; Xie, Dervieux, and Riederer 2020), shiny (Version 1.7.5.1; Chang et al. 2021), and stringr (Version 1.5.0; Wickham 2022) to generate this document.↩︎