Using a GPT2 transformer model to get word predictability • pangoling

Transformer models are a type of neural network architecture used for natural language processing tasks such as language translation and text generation. They were introduced in the Vaswani et al. (2017) paper “Attention Is All You Need”.

Large Language Models (LLMs) are a specific type of pre-trained transformer models. These models have been trained on massive amounts of text data and can be fine-tuned to perform a variety of NLP tasks such as text classification, named entity recognition, question answering, etc.

A causal language model (also called GPT-like, auto-regressive, or decoder model) is a type of large language model usually used for text-generation that can predict the next word (or more accurately, the next token) based on a preceding context. GPT-2 (Generative Pre-trained Transformer 2) developed by OpenAI is an example of a causal language model (see also Radford et al. 2019).

One interesting side-effect of causal language models is that the (log) probability of a word given a certain context can be extracted from the models.

Load the following packages first:

library(pangoling)
library(tidytable) # fast alternative to dplyr
library(tictoc) # measure time

Then let’s examine which continuation GPT-2 predicts following a specific context. Hugging Face provide access to pre-trained models, including freely available versions of different sizes of GPT-2. The function causal_next_tokens_pred_tbl() will, by default, use the smallest version of GPT-2, but this can be modified with the argument model.

Let’s see what GPT-2 predicts following “The apple doesn’t fall far from the”.

tic()
(df_pred <- causal_next_tokens_pred_tbl("The apple doesn't fall far from the"))
#> Processing using causal model 'gpt2/' ...
#> # A tidytable: 50,257 × 2
#>    token     pred
#>    <chr>    <dbl>
#>  1 Ġtree   -0.281
#>  2 Ġtrees  -3.60 
#>  3 Ġapple  -4.29 
#>  4 Ġtable  -4.50 
#>  5 Ġhead   -4.83 
#>  6 Ġmark   -4.86 
#>  7 Ġcake   -4.91 
#>  8 Ġground -5.08 
#>  9 Ġtruth  -5.31 
#> 10 Ġtop    -5.36 
#> # ℹ 50,247 more rows
toc()
#> 4.31 sec elapsed

(The pretrained models and tokenizers will be downloaded from https://huggingface.co/ the first time they are used.)

The most likely continuation is “tree”, which makes sense. The first time a model is run, it will download some files that will be available for subsequent runs. However, every time we start a new R session and we run a model, it will take some time to store it in memory. Next runs in the same session are much faster. We can also preload a model with causal_preload().

tic()
(df_pred <- causal_next_tokens_pred_tbl("The apple doesn't fall far from the"))
#> Processing using causal model 'gpt2/' ...
#> # A tidytable: 50,257 × 2
#>    token     pred
#>    <chr>    <dbl>
#>  1 Ġtree   -0.281
#>  2 Ġtrees  -3.60 
#>  3 Ġapple  -4.29 
#>  4 Ġtable  -4.50 
#>  5 Ġhead   -4.83 
#>  6 Ġmark   -4.86 
#>  7 Ġcake   -4.91 
#>  8 Ġground -5.08 
#>  9 Ġtruth  -5.31 
#> 10 Ġtop    -5.36 
#> # ℹ 50,247 more rows
toc()
#> 0.23 sec elapsed

Notice that the tokens–that is, the way GPT-2 interprets words– that are predicted start with Ġ, this indicates that they are not the first word of a sentence.

In fact this is the way GPT-2 interprets our context:

tokenize_lst("The apple doesn't fall far from the")
#> [[1]]
#> [1] "The"    "Ġapple" "Ġdoesn" "'t"     "Ġfall"  "Ġfar"   "Ġfrom"  "Ġthe"

Also notice that GPT-2 tokenizer interprets differently initial tokens from tokens that follow a space. A space in a token is indicated with “Ġ”.

tokenize_lst("This is different from This")
#> [[1]]
#> [1] "This"       "Ġis"        "Ġdifferent" "Ġfrom"      "ĠThis"

It’s also possible to decode the tokens to get “pure” text:

tokenize_lst("This is different from This", decode = TRUE)
#> [[1]]
#> [1] "This"       " is"        " different" " from"      " This"

Going back to the initial example, because causal_next_tokens_pred_tbl() returns by default log natural probabilities, if we exponentiate them and we sum them, we should get 1:

sum(exp(df_pred$pred))
#> [1] 1.000017

Because of approximation errors, this is not exactly one.

When doing tests, sshleifer/tiny-gpt2 is quite useful since it’s much faster because it’s a tiny model. But notice that the predictions are quite bad.

causal_preload("sshleifer/tiny-gpt2")
#> Preloading causal model sshleifer/tiny-gpt2...
tic()
causal_next_tokens_pred_tbl("The apple doesn't fall far from the",
  model = "sshleifer/tiny-gpt2"
)
#> Processing using causal model 'sshleifer/tiny-gpt2/' ...
#> # A tidytable: 50,257 × 2
#>    token          pred
#>    <chr>         <dbl>
#>  1 Ġstairs       -10.7
#>  2 Ġvendors      -10.7
#>  3 Ġintermittent -10.7
#>  4 Ġhauled       -10.7
#>  5 ĠBrew         -10.7
#>  6 Rocket        -10.7
#>  7 dit           -10.7
#>  8 ĠHabit        -10.7
#>  9 ĠJr           -10.7
#> 10 ĠRh           -10.7
#> # ℹ 50,247 more rows
toc()
#> 0.087 sec elapsed

All in all, the package pangoling would be most useful in the following situation. (And see also the worked-out example vignette.)

Given a (toy) dataset where sentences are organized with one word or short phrase in each row:

sentences <- c(
  "The apple doesn't fall far from the tree.",
  "Don't judge a book by its cover."
)
df_sent <- strsplit(x = sentences, split = " ") |>
  map_dfr(.f = ~ data.frame(word = .x), .id = "sent_n")
df_sent
#> # A tidytable: 15 × 2
#>    sent_n word   
#>     <int> <chr>  
#>  1      1 The    
#>  2      1 apple  
#>  3      1 doesn't
#>  4      1 fall   
#>  5      1 far    
#>  6      1 from   
#>  7      1 the    
#>  8      1 tree.  
#>  9      2 Don't  
#> 10      2 judge  
#> 11      2 a      
#> 12      2 book   
#> 13      2 by     
#> 14      2 its    
#> 15      2 cover.

One can get the natural log-transformed probability of each word based on GPT-2 as follows:

df_sent <- df_sent |>
  mutate(lp = causal_words_pred(word, by = sent_n))
#> Processing using causal model 'gpt2/' ...
#> Processing a batch of size 1 with 10 tokens.
#> Processing a batch of size 1 with 9 tokens.
#> Text id: 1
#> `The apple doesn't fall far from the tree.`
#> Text id: 2
#> `Don't judge a book by its cover.`
#> ***
df_sent
#> # A tidytable: 15 × 3
#>    sent_n word         lp
#>     <int> <chr>     <dbl>
#>  1      1 The      NA    
#>  2      1 apple   -10.9  
#>  3      1 doesn't  -5.50 
#>  4      1 fall     -3.60 
#>  5      1 far      -2.91 
#>  6      1 from     -0.745
#>  7      1 the      -0.207
#>  8      1 tree.    -1.58 
#>  9      2 Don't    NA    
#> 10      2 judge    -6.27 
#> 11      2 a        -2.33 
#> 12      2 book     -1.97 
#> 13      2 by       -0.409
#> 14      2 its      -0.257
#> 15      2 cover.   -1.38

Notice that the by is inside the causal_words_pred() function. It’s also possible to use by in the mutate call, or group_by(), but it will be slower.

The attentive reader might have noticed that the log-probability of “tree” here is not the same as the one presented before. This is because the actual word is " tree." (notice the space), which contains two tokens:

tokenize_lst(" tree.")
#> [[1]]
#> [1] "Ġtree" "."

The log-probability of " tree." is the sum of the log-probability of " tree" given its context and "." given its context.

We can verify this in the following way.

df_token_lp <- causal_tokens_pred_lst(
  "The apple doesn't fall far from the tree.") |>
  # convert the list into a data frame
  map_dfr(~ data.frame(token = names(.x), pred = .x))
#> Processing using causal model 'gpt2/' ...
#> Processing a batch of size 1 with 10 tokens.
df_token_lp
#> # A tidytable: 10 × 2
#>    token        pred
#>    <chr>       <dbl>
#>  1 The     NA       
#>  2 Ġapple -10.9     
#>  3 Ġdoesn  -5.50    
#>  4 't      -0.000828
#>  5 Ġfall   -3.60    
#>  6 Ġfar    -2.91    
#>  7 Ġfrom   -0.745   
#>  8 Ġthe    -0.207   
#>  9 Ġtree   -0.281   
#> 10 .       -1.30

(tree_lp <- df_token_lp |> 
  # requires a Ġ because there is a space before
  filter(token == "Ġtree") |>
  pull())
#> [1] -0.2808041

(dot_lp <- df_token_lp |>
  # doesn't require a Ġ because there is no space before
  filter(token == ".") |>
  pull())
#> [1] -1.300937

tree._lp <- df_sent |>
  filter(word == "tree.") |>
  pull()

# Test whether it is equal
all.equal(
  tree_lp + dot_lp,
  tree._lp
)
#> [1] TRUE

In a scenario as the one above, when one has a word-by-word text, and one wants to know the log-probability of each word, one doesn’t have to worry about the encoding or tokens, since the function causal_words_pred() takes care of it.

References

Radford, Alec, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. 2019. “Language Models Are Unsupervised Multitask Learners.” OpenAI Blog 1 (8): 9.

Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. “Attention Is All You Need.” Advances in Neural Information Processing Systems 30.