Forced Splits Using Regex¶

In the GPT series of models, the tokenizer doesn't just blindly apply the BPE algorithm. It also uses regular expressions (regex) to force certain splits in the input text before applying BPE. This is done to prevent certain types of tokens from being merged together.

Let's take a look at the regex pattern used in GPT-2:

import regex as re
gpt2pat = re.compile(r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""")

This looks quite complex, but let's break it down piece by piece:

's|'t|'re|'ve|'m|'ll|'d: These are common English contractions. The | means "or", so this part of the pattern will match any of these contractions.
?\p{L}+: This will match a single optional space followed by one or more Unicode letters. The \p{L} is a Unicode character class that matches any kind of letter from any language.
?\p{N}+: Similar to the previous one, but this matches numbers (\p{N} is the Unicode character class for numbers).
?[^\s\p{L}\p{N}]+: This will match an optional space followed by one or more characters that are not a whitespace, letter, or number. Essentially, this catches any punctuation or symbol characters.
\s+(?!\S): This matches one or more whitespace characters that are not followed by a non-whitespace character. This is a bit tricky. It uses a negative lookahead assertion (?!\S) which essentially means "match whitespace, but only if it's not followed by a non-whitespace character". This has the effect of matching all trailing whitespace.
\s+: Finally, this just matches any remaining whitespace.

Let's see what happens when we apply this pattern to a string:

print(re.findall(gpt2pat, "Hello've world123 how's are you!!!?"))

This prints:

['Hello', "'ve", ' world', '123', ' how', "'s", ' are', ' you', '!!!?']

As you can see, the string has been split into tokens, with the contractions, words, numbers, and punctuation all being separated.

The effect of this is that the BPE algorithm will only merge tokens within these categories. It will never merge a punctuation character with a letter, for example, because they will always be in separate tokens.

Last update: 2024-08-21