Using a GPT2 transformer model to get word predictability
Source:vignettes/articles/intro-gpt2.Rmd
intro-gpt2.Rmd
Transformer models are a type of neural network architecture used for natural language processing tasks such as language translation and text generation. They were introduced in the Vaswani et al. (2017) paper “Attention Is All You Need”.
Large Language Models (LLMs) are a specific type of pre-trained transformer models. These models have been trained on massive amounts of text data and can be fine-tuned to perform a variety of NLP tasks such as text classification, named entity recognition, question answering, etc.
A causal language model (also called GPT-like, auto-regressive, or decoder model) is a type of large language model usually used for text-generation that can predict the next word (or more accurately, the next token) based on a preceding context. GPT-2 (Generative Pre-trained Transformer 2) developed by OpenAI is an example of a causal language model (see also Radford et al. 2019).
One interesting side-effect of causal language models is that the (log) probability of a word given a certain context can be extracted from the models.
Load the following packages first:
Then let’s examine which continuation GPT-2 predicts following a
specific context. Hugging Face
provide access to pre-trained models, including freely available
versions of different sizes of GPT-2. The function
causal_next_tokens_pred_tbl()
will, by default, use the
smallest version of GPT-2, but
this can be modified with the argument model
.
Let’s see what GPT-2 predicts following “The apple doesn’t fall far from the”.
tic()
(df_pred <- causal_next_tokens_pred_tbl("The apple doesn't fall far from the"))
#> Processing using causal model 'gpt2/' ...
#> # A tidytable: 50,257 × 2
#> token pred
#> <chr> <dbl>
#> 1 Ġtree -0.281
#> 2 Ġtrees -3.60
#> 3 Ġapple -4.29
#> 4 Ġtable -4.50
#> 5 Ġhead -4.83
#> 6 Ġmark -4.86
#> 7 Ġcake -4.91
#> 8 Ġground -5.08
#> 9 Ġtruth -5.31
#> 10 Ġtop -5.36
#> # ℹ 50,247 more rows
toc()
#> 4.31 sec elapsed
(The pretrained models and tokenizers will be downloaded from https://huggingface.co/ the first time they are used.)
The most likely continuation is “tree”, which makes sense. The first
time a model is run, it will download some files that will be available
for subsequent runs. However, every time we start a new R session and we
run a model, it will take some time to store it in memory. Next runs in
the same session are much faster. We can also preload a model with
causal_preload()
.
tic()
(df_pred <- causal_next_tokens_pred_tbl("The apple doesn't fall far from the"))
#> Processing using causal model 'gpt2/' ...
#> # A tidytable: 50,257 × 2
#> token pred
#> <chr> <dbl>
#> 1 Ġtree -0.281
#> 2 Ġtrees -3.60
#> 3 Ġapple -4.29
#> 4 Ġtable -4.50
#> 5 Ġhead -4.83
#> 6 Ġmark -4.86
#> 7 Ġcake -4.91
#> 8 Ġground -5.08
#> 9 Ġtruth -5.31
#> 10 Ġtop -5.36
#> # ℹ 50,247 more rows
toc()
#> 0.23 sec elapsed
Notice that the tokens–that is, the way GPT-2 interprets words– that
are predicted start with Ġ
, this indicates that they are
not the first word of a sentence.
In fact this is the way GPT-2 interprets our context:
tokenize_lst("The apple doesn't fall far from the")
#> [[1]]
#> [1] "The" "Ġapple" "Ġdoesn" "'t" "Ġfall" "Ġfar" "Ġfrom" "Ġthe"
Also notice that GPT-2 tokenizer interprets differently initial tokens from tokens that follow a space. A space in a token is indicated with “Ġ”.
tokenize_lst("This is different from This")
#> [[1]]
#> [1] "This" "Ġis" "Ġdifferent" "Ġfrom" "ĠThis"
It’s also possible to decode the tokens to get “pure” text:
tokenize_lst("This is different from This", decode = TRUE)
#> [[1]]
#> [1] "This" " is" " different" " from" " This"
Going back to the initial example, because
causal_next_tokens_pred_tbl()
returns by default log
natural probabilities, if we exponentiate them and we sum them, we
should get 1:
Because of approximation errors, this is not exactly one.
When doing tests, sshleifer/tiny-gpt2
is quite useful since it’s much faster because it’s a tiny model. But
notice that the predictions are quite bad.
causal_preload("sshleifer/tiny-gpt2")
#> Preloading causal model sshleifer/tiny-gpt2...
tic()
causal_next_tokens_pred_tbl("The apple doesn't fall far from the",
model = "sshleifer/tiny-gpt2"
)
#> Processing using causal model 'sshleifer/tiny-gpt2/' ...
#> # A tidytable: 50,257 × 2
#> token pred
#> <chr> <dbl>
#> 1 Ġstairs -10.7
#> 2 Ġvendors -10.7
#> 3 Ġintermittent -10.7
#> 4 Ġhauled -10.7
#> 5 ĠBrew -10.7
#> 6 Rocket -10.7
#> 7 dit -10.7
#> 8 ĠHabit -10.7
#> 9 ĠJr -10.7
#> 10 ĠRh -10.7
#> # ℹ 50,247 more rows
toc()
#> 0.087 sec elapsed
All in all, the package pangoling
would be most useful
in the following situation. (And see also the worked-out example vignette.)
Given a (toy) dataset where sentences are organized with one word or short phrase in each row:
sentences <- c(
"The apple doesn't fall far from the tree.",
"Don't judge a book by its cover."
)
df_sent <- strsplit(x = sentences, split = " ") |>
map_dfr(.f = ~ data.frame(word = .x), .id = "sent_n")
df_sent
#> # A tidytable: 15 × 2
#> sent_n word
#> <int> <chr>
#> 1 1 The
#> 2 1 apple
#> 3 1 doesn't
#> 4 1 fall
#> 5 1 far
#> 6 1 from
#> 7 1 the
#> 8 1 tree.
#> 9 2 Don't
#> 10 2 judge
#> 11 2 a
#> 12 2 book
#> 13 2 by
#> 14 2 its
#> 15 2 cover.
One can get the natural log-transformed probability of each word based on GPT-2 as follows:
df_sent <- df_sent |>
mutate(lp = causal_words_pred(word, by = sent_n))
#> Processing using causal model 'gpt2/' ...
#> Processing a batch of size 1 with 10 tokens.
#> Processing a batch of size 1 with 9 tokens.
#> Text id: 1
#> `The apple doesn't fall far from the tree.`
#> Text id: 2
#> `Don't judge a book by its cover.`
#> ***
df_sent
#> # A tidytable: 15 × 3
#> sent_n word lp
#> <int> <chr> <dbl>
#> 1 1 The NA
#> 2 1 apple -10.9
#> 3 1 doesn't -5.50
#> 4 1 fall -3.60
#> 5 1 far -2.91
#> 6 1 from -0.745
#> 7 1 the -0.207
#> 8 1 tree. -1.58
#> 9 2 Don't NA
#> 10 2 judge -6.27
#> 11 2 a -2.33
#> 12 2 book -1.97
#> 13 2 by -0.409
#> 14 2 its -0.257
#> 15 2 cover. -1.38
Notice that the by
is inside the
causal_words_pred()
function. It’s also possible to use
by
in the mutate call, or group_by()
, but it
will be slower.
The attentive reader might have noticed that the log-probability of
“tree” here is not the same as the one presented before. This is because
the actual word is " tree."
(notice the space), which
contains two tokens:
tokenize_lst(" tree.")
#> [[1]]
#> [1] "Ġtree" "."
The log-probability of " tree."
is the sum of the
log-probability of " tree"
given its context and
"."
given its context.
We can verify this in the following way.
df_token_lp <- causal_tokens_pred_lst(
"The apple doesn't fall far from the tree.") |>
# convert the list into a data frame
map_dfr(~ data.frame(token = names(.x), pred = .x))
#> Processing using causal model 'gpt2/' ...
#> Processing a batch of size 1 with 10 tokens.
df_token_lp
#> # A tidytable: 10 × 2
#> token pred
#> <chr> <dbl>
#> 1 The NA
#> 2 Ġapple -10.9
#> 3 Ġdoesn -5.50
#> 4 't -0.000828
#> 5 Ġfall -3.60
#> 6 Ġfar -2.91
#> 7 Ġfrom -0.745
#> 8 Ġthe -0.207
#> 9 Ġtree -0.281
#> 10 . -1.30
(tree_lp <- df_token_lp |>
# requires a Ġ because there is a space before
filter(token == "Ġtree") |>
pull())
#> [1] -0.2808041
(dot_lp <- df_token_lp |>
# doesn't require a Ġ because there is no space before
filter(token == ".") |>
pull())
#> [1] -1.300937
tree._lp <- df_sent |>
filter(word == "tree.") |>
pull()
# Test whether it is equal
all.equal(
tree_lp + dot_lp,
tree._lp
)
#> [1] TRUE
In a scenario as the one above, when one has a word-by-word text, and
one wants to know the log-probability of each word, one doesn’t have to
worry about the encoding or tokens, since the function
causal_words_pred()
takes care of it.