Get the log probability of each element of a vector of words (or phrases) using a causal transformer

Get the log probability of each element of a vector of words (or phrases) using a causal transformer model. See the online article in pangoling website for more examples.

Usage

causal_lp(
  x,
  by = rep(1, length(x)),
  l_contexts = NULL,
  ignore_regex = "",
  model = getOption("pangoling.causal.default"),
  checkpoint = NULL,
  add_special_tokens = NULL,
  config_model = NULL,
  config_tokenizer = NULL,
  batch_size = 1,
  ...
)

Arguments

x: Vector of words, phrases or texts.
by: Vector that indicates how the text should be split.
l_contexts: Left context for each word in x. If l_contexts is used, by is ignored. Set by = NULL to avoid a message notifying that.
ignore_regex: Can ignore certain characters when calculates the log probabilities. For example ^[[:punct:]]$ will ignore all punctuation that stands alone in a token.
model: Name of a pre-trained model or folder.
checkpoint: Folder of a checkpoint.
add_special_tokens: Whether to include special tokens. It has the same default as the AutoTokenizer method in Python.
config_model: List with other arguments that control how the model from Hugging Face is accessed.
config_tokenizer: List with other arguments that control how the tokenizer from Hugging Face is accessed.
batch_size: Maximum size of the batch. Larges batches speedup processing but take more memory.
...: not in use.

Value

A named vector of log probabilities.

Details

A causal language model (also called GPT-like, auto-regressive, or decoder model) is a type of large language model usually used for text-generation that can predict the next word (or more accurately in fact token) based on a preceding context.

If not specified, the causal model that will be used is the one set in specified in the global option pangoling.causal.default, this can be accessed via getOption("pangoling.causal.default") (by default "gpt2"). To change the default option use options(pangoling.causal.default = "newcausalmodel").

A list of possible causal models can be found in Hugging Face website.

Using the config_model and config_tokenizer arguments, it's possible to control how the model and tokenizer from Hugging Face is accessed, see the Python method from_pretrained for details.

In case of errors when a new model is run, check the status of https://status.huggingface.co/

More examples

See the online article in pangoling website for more examples.

Examples

if (FALSE) { # interactive()
causal_lp(
  x = c("The", "apple", "doesn't", "fall", "far", "from", "the", "tree."),
  model = "gpt2"
)

causal_lp(
  x = "tree.",
  l_contexts = "The apple doesn't fall far from the tree.",
  by = NULL, # it's ignored anyways
  model = "gpt2"
)
}