When text is processed by a language model, the tokenizer converts the input text into a sequence of numerical IDs. Each token corresponds to a unique ID from the tokenizer's vocabulary. Here's how this process works:
{"Token": 1234, "ization": 5678, "is": 45, "a": 46, ".": 5}
[CLS], [SEP] for classification tasks).Example: "Tokenization is a foundational aspect."
- Tokens: ["Token", "ization", "is", "a", "foundational", "aspect", "."]
[UNK] for unknown) is used.Example mapping:
- "Token" → 1234
- "ization" → 5678
- "is" → 45
- "a" → 46
- "foundational" → 8910
- "aspect" → 1122
- "." → 5
So, the text:
"Tokenization is a foundational aspect."
Converts to:
[1234, 5678, 45, 46, 8910, 1122, 5]
[CLS] (Classification): Added at the beginning of the text.[SEP] (Separation): Used to denote the end of a sentence or segment.[PAD] (Padding): Used to make all input sequences the same length.For instance, the sequence might become:
[101, 1234, 5678, 45, 46, 8910, 1122, 5, 102]
Where:
- [CLS] → 101
- [SEP] → 102
Efficient Vocabulary Size:
"unbelievable" might not exist in the vocabulary but can be split into:"un", "believe", "able".Handling Rare Words:
"schoginize" → ["sch", "ogi", "nize"].Input to the Neural Network:
[1234, 5678, 45, 46] → Embedding Matrix.Contextual Understanding:
Hugging Face Tokenizers:
tokenizer = GPT2Tokenizer.frompretrained("gpt2") text = "Tokenization is a foundational aspect." tokens = tokenizer.tokenize(text) tokenids = tokenizer.converttokensto_ids(tokens)
print(tokens) # ['Token', 'ization', 'is', 'a', 'foundational', 'aspect', '.'] print(token_ids) # [1234, 5678, 45, 46, 8910, 1122, 5] ```