Compute predictability using a causal transformer model

These functions calculate the predictability of words, phrases, or tokens using a causal transformer model.

Usage

causal_words_pred(
  x,
  by = rep(1, length(x)),
  sep = " ",
  log.p = getOption("pangoling.log.p"),
  ignore_regex = "",
  model = getOption("pangoling.causal.default"),
  checkpoint = NULL,
  add_special_tokens = NULL,
  config_model = NULL,
  config_tokenizer = NULL,
  batch_size = 1,
  ...
)

causal_tokens_pred_lst(
  texts,
  log.p = getOption("pangoling.log.p"),
  model = getOption("pangoling.causal.default"),
  checkpoint = NULL,
  add_special_tokens = NULL,
  config_model = NULL,
  config_tokenizer = NULL,
  batch_size = 1
)

causal_targets_pred(
  targets,
  contexts = NULL,
  sep = " ",
  log.p = getOption("pangoling.log.p"),
  ignore_regex = "",
  model = getOption("pangoling.causal.default"),
  checkpoint = NULL,
  add_special_tokens = NULL,
  config_model = NULL,
  config_tokenizer = NULL,
  batch_size = 1,
  ...
)

Arguments

x: A character vector of words, phrases, or texts to evaluate.
by: A grouping variable indicating how texts are split into groups.
sep: A string specifying how words are separated within contexts or groups. Default is " ". For languages that don't have spaces between words (e.g., Chinese), set sep = "".
log.p: Base of the logarithm used for the output predictability values. If TRUE (default), the natural logarithm (base e) is used. If FALSE, the raw probabilities are returned. Alternatively, log.p can be set to a numeric value specifying the base of the logarithm (e.g., 2 for base-2 logarithms). To get surprisal in bits (rather than predictability), set log.p = 1/2.
ignore_regex: Can ignore certain characters when calculating the log probabilities. For example ^[[:punct:]]$ will ignore all punctuation that stands alone in a token.
model: Name of a pre-trained model or folder. One should be able to use models based on "gpt2". See hugging face website.
checkpoint: Folder of a checkpoint.
add_special_tokens: Whether to include special tokens. It has the same default as the AutoTokenizer method in Python.
config_model: List with other arguments that control how the model from Hugging Face is accessed.
config_tokenizer: List with other arguments that control how the tokenizer from Hugging Face is accessed.
batch_size: Maximum size of the batch. Larger batches speed up processing but take more memory.
...: Currently not in use.
texts: A vector or list of sentences or paragraphs.
targets: A character vector of target words or phrases.
contexts: A character vector of contexts corresponding to each target.

Value

For causal_targets_pred() and causal_words_pred(), a named numeric vector of predictability scores. For causal_tokens_pred_lst(), a list of named numeric vectors, one for each sentence or group.

Details

These functions calculate the predictability (by default the natural logarithm of the word probability) of words, phrases or tokens using a causal transformer model:

causal_targets_pred(): Evaluates specific target words or phrases based on their given contexts. Use when you have explicit context-target pairs to evaluate, with each target word or phrase paired with a single preceding context.
causal_words_pred(): Computes predictability for all elements of a vector grouped by a specified variable. Use when working with words or phrases split into groups, such as sentences or paragraphs, where predictability is computed for every word or phrase in each group.
causal_tokens_pred_lst(): Computes the predictability of each token in a sentence (or group of sentences) and returns a list of results for each sentence. Use when you want to calculate the predictability of every token in one or more sentences.

See the online article in pangoling website for more examples.

More details about causal models

A causal language model (also called GPT-like, auto-regressive, or decoder model) is a type of large language model usually used for text-generation that can predict the next word (or more accurately in fact token) based on a preceding context.

If not specified, the causal model used will be the one set in the global option pangoling.causal.default, this can be accessed via getOption("pangoling.causal.default") (by default "gpt2"). To change the default option use options(pangoling.causal.default = "newcausalmodel").

A list of possible causal models can be found in Hugging Face website.

Using the config_model and config_tokenizer arguments, it's possible to control how the model and tokenizer from Hugging Face is accessed, see the Python method from_pretrained for details.

In case of errors when a new model is run, check the status of https://status.huggingface.co/

Examples

# Using causal_targets_pred
causal_targets_pred(
  targets = c("tree.", "cover."),
  contexts = c("The apple doesn't fall far from the",
               "Don't judge a book by its"),
  model = "gpt2"
)
#> Processing using causal model 'gpt2/' ...
#> Processing a batch of size 1 with 10 tokens.
#> Processing a batch of size 1 with 9 tokens.
#> Text id: 1
#> `The apple doesn't fall far from the tree.`
#> Text id: 2
#> `Don't judge a book by its cover.`
#> ***
#>     tree.    cover. 
#> -1.581741 -1.377739 

# Using causal_words_pred
causal_words_pred(
  x = df_sent$word,
  by = df_sent$sent_n,
  model = "gpt2"
)
#> Processing using causal model 'gpt2/' ...
#> Processing a batch of size 1 with 10 tokens.
#> Processing a batch of size 1 with 9 tokens.
#> Text id: 1
#> `The apple doesn't fall far from the tree.`
#> Text id: 2
#> `Don't judge a book by its cover.`
#> ***
#>         The       apple     doesn't        fall         far        from 
#>          NA -10.9004850  -5.4999222  -3.5977628  -2.9119270  -0.7454857 
#>         the       tree.       Don't       judge           a        book 
#>  -0.2066502  -1.5817409          NA  -6.2653966  -2.3259120  -1.9679886 
#>          by         its      cover. 
#>  -0.4091438  -0.2572804  -1.3777395 

# Using causal_tokens_pred_lst
preds <- causal_tokens_pred_lst(
  texts = c("The apple doesn't fall far from the tree.",
            "Don't judge a book by its cover."),
  model = "gpt2"
)
#> Processing using causal model 'gpt2/' ...
#> Processing a batch of size 1 with 10 tokens.
#> Processing a batch of size 1 with 9 tokens.
preds
#> [[1]]
#>           The        Ġapple        Ġdoesn            't         Ġfall 
#>            NA -1.090049e+01 -5.499094e+00 -8.281615e-04 -3.597763e+00 
#>          Ġfar         Ġfrom          Ġthe         Ġtree             . 
#> -2.911927e+00 -7.454857e-01 -2.066502e-01 -2.808041e-01 -1.300937e+00 
#> 
#> [[2]]
#>         Don          't      Ġjudge          Ġa       Ġbook         Ġby 
#>          NA -2.58639312 -6.26539660 -2.32591200 -1.96798861 -0.40914381 
#>        Ġits      Ġcover           . 
#> -0.25728044 -0.02360982 -1.35412967 
#> 

# Convert the output to a tidy table
suppressPackageStartupMessages(library(tidytable))
map2_dfr(preds, seq_along(preds), 
~ data.frame(tokens = names(.x), pred = .x, id = .y))
#> # A tidytable: 19 × 3
#>    tokens       pred    id
#>    <chr>       <dbl> <int>
#>  1 The     NA            1
#>  2 Ġapple -10.9          1
#>  3 Ġdoesn  -5.50         1
#>  4 't      -0.000828     1
#>  5 Ġfall   -3.60         1
#>  6 Ġfar    -2.91         1
#>  7 Ġfrom   -0.745        1
#>  8 Ġthe    -0.207        1
#>  9 Ġtree   -0.281        1
#> 10 .       -1.30         1
#> 11 Don     NA            2
#> 12 't      -2.59         2
#> 13 Ġjudge  -6.27         2
#> 14 Ġa      -2.33         2
#> 15 Ġbook   -1.97         2
#> 16 Ġby     -0.409        2
#> 17 Ġits    -0.257        2
#> 18 Ġcover  -0.0236       2
#> 19 .       -1.35         2