Walkthrough of training a SentencePiece tokenizer

Training a SentencePiece Model¶

To train a SentencePiece model, we first need some training data. Let's use the description of SentencePiece from its GitHub page:

import sentencepiece as spm

# Write the training data to a file
with open("train.txt", "w", encoding="utf-8") as f:
    f.write("SentencePiece is an unsupervised text tokenizer and detokenizer mainly for Neural Network-based text generation systems where the vocabulary size is predetermined prior to the neural model training. SentencePiece implements subword units (e.g., byte-pair-encoding (BPE) [Sennrich et al.]) and unigram language model [Kudo.]) with the extension of direct training from raw sentences. SentencePiece allows us to make a purely end-to-end system that does not depend on language-specific pre/postprocessing.")

Now we can train a SentencePiece model on this data:

# Train the model
spm.SentencePieceTrainer.train(
    input="train.txt",
    model_prefix="m",
    vocab_size=500,
    model_type="bpe",
    character_coverage=1.0,
    num_threads=8,
    pad_id=0,
    unk_id=1,
    bos_id=2,
    eos_id=3
)

This will create two files: m.model and m.vocab. The .model file contains the learned BPE merges, and the .vocab file contains the vocabulary.

Let's break down the arguments we passed to train:

input: The path to the training data file.
model_prefix: The prefix for the output model files.
vocab_size: The size of the vocabulary to learn.
model_type: The type of model to learn. Here we're using BPE, but SentencePiece also supports unigram and char models.
character_coverage: The amount of characters to cover by the model. 1.0 means all characters in the input will be included.
num_threads: The number of threads to use for training.
pad_id, unk_id, bos_id, eos_id: The IDs to use for special tokens (padding, unknown, beginning of sequence, end of sequence).

There are many more arguments you can pass to fine-tune the behavior of SentencePiece. Check the documentation for more details.

Using the Trained Model¶

Once we have a trained model, we can use it to tokenize and detokenize text:

# Load the model
sp = spm.SentencePieceProcessor(model_file="m.model")

# Tokenize a string
text = "This is a test."
print(sp.encode(text))  # [10, 6, 8, 9, 4]

# Detokenize a list of IDs
ids = [10, 6, 8, 9, 4]
print(sp.decode(ids))  # "This is a test."

We can also inspect the vocabulary:

# Print the vocabulary
print(sp.vocab_size())  # 500
for i in range(sp.vocab_size()):
    print(f"{i}: {sp.id_to_piece(i)}")

This will print out each token in the vocabulary along with its ID.

Handling Unknown Characters¶

One key difference between SentencePiece and the GPT tokenizers is how they handle characters that weren't seen during training. The GPT tokenizers use a "byte fallback" approach, where unknown characters are encoded as individual bytes. SentencePiece, on the other hand, has a special "unknown" token (usually <unk>).

You can control how SentencePiece handles unknown characters with the character_coverage parameter during training. If you set character_coverage to less than 1.0, SentencePiece will only include the most frequent characters in the vocabulary, and map the rest to <unk>. This can be useful for reducing the vocabulary size, but it means that the model won't be able to output any characters that weren't in the training data.

If you need the model to be able to output arbitrary characters, you can set character_coverage to 1.0 (the default). In this case, SentencePiece will include all characters from the training data in the vocabulary, and will split any unknown characters into individual Unicode characters.

SentencePiece in the Wild¶

As mentioned earlier, SentencePiece is used in many state-of-the-art models. For example, the Llama models released by Facebook use SentencePiece with a vocabulary size of 32,000. You can find the exact training parameters they used in the Llama repository.

Similarly, the BERT models use a variant of SentencePiece called WordPiece. The main difference is that WordPiece does not include a mechanism for handling unknown characters - it assumes that all characters in the input will be in the vocabulary.

That covers the basics of SentencePiece! It's a powerful and flexible tokenizer that's worth being familiar with. In the next section, we'll talk about how to choose an appropriate vocabulary size for your model.

Last update: 2024-08-21