Walkthrough of the GPT-2 and GPT-4 regex patterns
Let's look at a Python code example:
example = """
for i in range(1, 101):
if i % 3 == 0 and i % 5 == 0:
print("FizzBuzz")
elif i % 3 == 0:
print("Fizz")
elif i % 5 == 0:
print("Buzz")
else:
print(i)
"""
print(re.findall(gpt2pat, example))
This prints:
['\n', 'for', ' i', ' in', ' range', '(', '1', ',', ' ', '101', ')', ':', '\n', ' ', ' ', ' ', ' if', ' i', ' ', '%', ' ', '3', ' ', '==', ' ', '0', ' and', ' i', ' ', '%', ' ', '5', ' ', '==', ' ', '0', ':', '\n', ' ', ' ', ' ', ' ', ' ', ' ', ' print', '(', '"', 'FizzBuzz', '"', ')', '\n', ' ', ' ', ' ', ' elif', ' i', ' ', '%', ' ', '3', ' ', '==', ' ', '0', ':', '\n', ' ', ' ', ' ', ' ', ' ', ' ', ' print', '(', '"', 'Fizz', '"', ')', '\n', ' ', ' ', ' ', ' elif', ' i', ' ', '%', ' ', '5', ' ', '==', ' ', '0', ':', '\n', ' ', ' ', ' ', ' ', ' ', ' ', ' print', '(', '"', 'Buzz', '"', ')', '\n', ' ', ' ', ' ', ' else', ':', '\n', ' ', ' ', ' ', ' ', ' ', ' ', ' print', '(', 'i', ')', '\n']
As you can see, the code has been split into a large number of tokens, with each individual element (keywords, variables, operators, numbers, etc.) being a separate token. The indentation spaces are also preserved as separate tokens.
This tokenization allows the model to understand the structure of the code better. It can learn, for example, that the keyword if is often followed by a condition, then a colon, then an indented block. If all these elements were merged together, this structure would be much harder to learn.
In the GPT-4 tokenizer, the regex pattern was updated to include some additional rules, such as splitting on Unicode scripts and limiting number merges to a maximum of 3 digits. You can find the updated pattern in the tiktoken library:
The cl100k_base encoding is the one used by GPT-4. If you look at the source code for this encoding, you'll see the updated regex pattern.
That's a quick overview of how the GPT tokenizers use regex to improve the tokenization for the BPE algorithm. In the next section, we'll dive into the actual implementation of the GPT-2 tokenizer released by OpenAI.