Get the possible tokens and their log probabilities for each mask in a sentence

For each mask, indicated with [MASK], in a sentence, get the possible tokens and their predictability (by default the natural logarithm of the word probability) using a masked transformer.

Usage

masked_tokens_pred_tbl(
  masked_sentences,
  log.p = getOption("pangoling.log.p"),
  model = getOption("pangoling.masked.default"),
  checkpoint = NULL,
  add_special_tokens = NULL,
  config_model = NULL,
  config_tokenizer = NULL
)

Arguments

masked_sentences: Masked sentences.
log.p: Base of the logarithm used for the output predictability values. If TRUE (default), the natural logarithm (base e) is used. If FALSE, the raw probabilities are returned. Alternatively, log.p can be set to a numeric value specifying the base of the logarithm (e.g., 2 for base-2 logarithms). To get surprisal in bits (rather than predictability), set log.p = 1/2.
model: Name of a pre-trained model or folder. One should be able to use models based on "bert". See hugging face website.
checkpoint: Folder of a checkpoint.
add_special_tokens: Whether to include special tokens. It has the same default as the AutoTokenizer method in Python.
config_model: List with other arguments that control how the model from Hugging Face is accessed.
config_tokenizer: List with other arguments that control how the tokenizer from Hugging Face is accessed.

Value

A table with the masked sentences, the tokens (token), predictability (pred), and the respective mask number (mask_n).

Details

A masked language model (also called BERT-like, or encoder model) is a type of large language model that can be used to predict the content of a mask in a sentence.

If not specified, the masked model that will be used is the one set in specified in the global option pangoling.masked.default, this can be accessed via getOption("pangoling.masked.default") (by default "bert-base-uncased"). To change the default option use options(pangoling.masked.default = "newmaskedmodel").

A list of possible masked can be found in Hugging Face website

Using the config_model and config_tokenizer arguments, it's possible to control how the model and tokenizer from Hugging Face is accessed, see the python method from_pretrained for details. In case of errors check the status of https://status.huggingface.co/

More examples

See the online article in pangoling website for more examples.

Examples

masked_tokens_pred_tbl("The [MASK] doesn't fall far from the tree.",
  model = "bert-base-uncased"
)
#> Processing using masked model 'bert-base-uncased/' ...
#> # A tidytable: 30,522 × 4
#>    masked_sentence                            token   pred mask_n
#>    <chr>                                      <chr>  <dbl>  <int>
#>  1 The [MASK] doesn't fall far from the tree. snow   -3.12      1
#>  2 The [MASK] doesn't fall far from the tree. rock   -3.45      1
#>  3 The [MASK] doesn't fall far from the tree. stone  -3.77      1
#>  4 The [MASK] doesn't fall far from the tree. girl   -3.85      1
#>  5 The [MASK] doesn't fall far from the tree. sun    -3.99      1
#>  6 The [MASK] doesn't fall far from the tree. tree   -3.99      1
#>  7 The [MASK] doesn't fall far from the tree. branch -4.09      1
#>  8 The [MASK] doesn't fall far from the tree. body   -4.12      1
#>  9 The [MASK] doesn't fall far from the tree. light  -4.43      1
#> 10 The [MASK] doesn't fall far from the tree. water  -4.44      1
#> # ℹ 30,512 more rows