Returns the vocabulary of a model

Returns the (decoded) vocabulary of a model.

Usage

transformer_vocab(
  model = getOption("pangoling.causal.default"),
  add_special_tokens = NULL,
  decode = FALSE,
  config_tokenizer = NULL
)

model: Name of a pre-trained model or folder. One should be able to use models based on "gpt2". See hugging face website.
add_special_tokens: Whether to include special tokens. It has the same default as the AutoTokenizer method in Python.
decode: Logical. If TRUE, decodes the tokens into human-readable strings, handling special characters and diacritics. Default is FALSE.
config_tokenizer: List with other arguments that control how the tokenizer from Hugging Face is accessed.

A vector with the vocabulary of a model.

transformer_vocab(model = "gpt2") |>
 head()
#> [1] "!"  "\"" "#"  "$"  "%"  "&"