Tokenize a string or token ids.
Usage
tokenize_lst(
x,
decode = FALSE,
model = getOption("pangoling.causal.default"),
add_special_tokens = NULL,
config_tokenizer = NULL
)
Arguments
- x
Strings or token ids.
- decode
Logical. If
TRUE
, decodes the tokens into human-readable strings, handling special characters and diacritics. Default isFALSE
.- model
Name of a pre-trained model or folder. One should be able to use models based on "gpt2". See hugging face website.
- add_special_tokens
Whether to include special tokens. It has the same default as the AutoTokenizer method in Python.
- config_tokenizer
List with other arguments that control how the tokenizer from Hugging Face is accessed.
See also
Other token-related functions:
ntokens()
,
transformer_vocab()
Examples
tokenize_lst(x = c("The apple doesn't fall far from the tree."),
model = "gpt2")
#> [[1]]
#> [1] "The" "Ġapple" "Ġdoesn" "'t" "Ġfall" "Ġfar" "Ġfrom" "Ġthe"
#> [9] "Ġtree" "."
#>