In Large Language Models, a token is the fundamental unit of text processing. When we input text like 'Hello, how are you today?', the model breaks it down into individual tokens through a process called tokenization. Each word, punctuation mark, or meaningful text segment becomes a separate token that the model can understand and process.
There are three main types of tokenization approaches. Word-level tokenization treats each complete word as a single token. Subword tokenization breaks words into meaningful parts like prefixes, roots, and suffixes. Character-level tokenization splits text into individual characters. Each approach has different advantages for language understanding and processing.
Byte Pair Encoding, or BPE, is a widely used subword tokenization algorithm. It starts by treating each character as a separate token. Then it iteratively finds the most frequent pair of adjacent tokens and merges them into a single token. For example, with the word 'lower', we start with individual characters, then merge frequent pairs like 'e' and 'r' to form 'er', and 'o' and 'w' to form 'ow', eventually creating meaningful subword units.
Once tokens are created, each one is assigned a unique numerical identifier called a token ID. These IDs are then converted into high-dimensional vectors called embeddings, typically containing hundreds of dimensions. Each dimension captures different aspects of the token's meaning, allowing the model to understand semantic relationships and context between different tokens.
Every LLM has a context window - a maximum number of tokens it can process at once. This limit varies by model: GPT-3 handles about 4,000 tokens, GPT-4 can process 8,000 or more, while newer models like Claude can handle over 100,000 tokens. When input exceeds this limit, older tokens are typically discarded, which can affect the model's understanding of earlier context.