Additional special tokens in GPT-4

Additional Special Tokens in GPT-4¶

In the GPT-4 tokenizer, a few more special tokens were introduced. You can see these in the tiktoken library:

from tiktoken import get_encoding

enc = get_encoding("cl100k_base")
print(enc.special_tokens)

This prints:

['<|endoftext|>', '<|fim_prefix|>', '<|fim_middle|>', '<|fim_suffix|>', '<|endofprompt|>']

In addition to the <|endoftext|> token, we now have:

<|fim_prefix|>, <|fim_middle|>, and <|fim_suffix|>: These tokens are used for a technique called "fill-in-the-middle" (FIM), which allows the model to fill in missing parts of a text rather than just continuing from the end. The details of this technique are beyond the scope of this tutorial, but you can read more about it in the FIM paper.
<|endofprompt|>: This token is used to separate the user's prompt from the model's response. This is useful for controlling the model's behavior in conversational or question-answering settings.

These special tokens are handled a bit differently than the regular tokens. They are not part of the BPE vocabulary, but are instead added to the vocabulary after the BPE merges have been applied. This means they will always be single tokens, even if their substrings occur frequently in the training data.

Last update: 2024-08-21