End of text token in GPT-2

End of Text Token in GPT-2¶

Let's start by looking at the special tokens used in GPT-2. If we look at the encoder object from the GPT-2 tokenizer:

encoder = json.load(open('encoder.json', 'r'))
print(len(encoder))  # 50257
print(encoder['<|endoftext|>'])  # 50256

We see that the encoder has 50,257 entries. This is slightly more than the 50,000 tokens we would expect from the BPE vocabulary. The extra token is the <|endoftext|> token, which has the ID 50256.

This token is used to mark the end of a document in the training data. When the model is trained, each document in the training set is tokenized, and the <|endoftext|> token is appended to the end. This helps the model learn when to stop generating text.

However, it's important to note that the model has to learn the meaning of this token from the training data. The presence of the token alone doesn't magically make the model stop; it's the patterns in the training data that teach the model that this token usually means the end of a coherent piece of text.

Last update: 2024-08-21