Large Language Models: Architecture and Mechanisms

by ✨ OPUS4i 4mo ago

1. LLM Pipeline (Pre-Inference → Inference → Post-Inference)

Overview: An LLM processes input text through several stages: (1) Pre-inference steps like data preprocessing, tokenization, and embedding lookup; (2) Inference within the Transformer network (multiple self-attention and feed-forward layers); and (3) Post-inference steps for decoding the model’s output into text. Essentially, raw text is converted to tokens, fed through the Transformer layers to produce probabilities for next tokens, and then decoded back into human-readable text. For example, BERT’s architecture explicitly breaks down into a Tokenizer (text → tokens), an Embedding layer (tokens → vectors), a Transformer Encoder stack, and an output layer that produces a probability distribution over the vocabulary (BERT (language model) - Wikipedia). Below is a step-by-step pipeline:

Data Preprocessing & Tokenization: The input text may be normalized (e.g. lowercased or cleaned) as needed by the model. A tokenizer then splits the text into subword tokens and maps them to unique IDs in the model’s vocabulary (Decoding Strategies in Large Language Models). For example, "Transformers are great!" might be tokenized into ["Transform", "ers", " are", " great", "!"] each with an integer ID. Some architectures add special tokens at this stage (e.g. BERT prepends [CLS] and appends [SEP] tokens (BERT (language model) - Wikipedia)). The result is a sequence of token IDs representing the input text.
Embedding Lookup & Positional Encoding: Each token ID is converted to a dense vector via an embedding matrix. The embedding matrix is essentially a large lookup table (size VocabSize × d_model); indexing it by a token ID returns that token’s vector representation (How does nn.Embedding work? - PyTorch Forums). Formally, if E is the embedding matrix and t_i is the i-th token ID, the embedding vector is $x_i = E[t_i]$. These token embeddings are then supplemented with positional encodings to inject word-order information (since self-attention is order-agnostic) (Attention is All you Need). Positional encodings can be fixed (e.g. sinusoidal patterns) or learned vectors of the same dimension added to each token’s embedding. After this step, we have a sequence of input vectors $X = (x_1, x_2, ..., x_n)$ each of dimension $d_{model}` representing the token sequence in context of their positions.
Transformer Layers (Inference): The sequence of embeddings is then fed through the Transformer’s layers. Transformer layers consist of a Multi-Head Self-Attention sub-layer followed by a position-wise feed-forward sub-layer, with residual connections and layer normalizations around each (BERT (language model) - Wikipedia). During self-attention, each token’s representation is updated by attending to all other tokens (for encoder layers or decoder layers without causal mask) or previous tokens (for decoder layers with causal masking). The self-attention mechanism produces contextualized representations for each position (see Section 3 for mathematical detail). Then the feed-forward network (FFN) (typically a 2-layer MLP with a nonlinearity like ReLU or GeLU) is applied to each position independently, further transforming the representation. Each sub-layer output is added to its input (residual connection) and normalized (LayerNorm) (transformer/models/layers/layer_norm.py at master · hyunwoongko/transformer · GitHub). Stacking many such layers (e.g. 12 in GPT-3 small models, up to 96+ in GPT-4) builds a deep transformer. If the model has an encoder-decoder architecture (like T5), the input goes through the encoder stack (bidirectional attention), and the decoder stack generates output tokens autoregressively, using cross-attention to the encoder’s final outputs in addition to self-attention.
Output Projection to Vocabulary: After the final Transformer layer, we obtain an output vector for each token position (or for the last generated token in autoregressive generation). These vectors are then projected back to the vocabulary space via a linear layer (often the transpose of the embedding matrix) to produce a score (logit) for each possible next token (BERT (language model) - Wikipedia). In a language generation setting (like GPT), typically we focus on the last position’s output logits, which represent the model’s predicted probability distribution for the next token given the input sequence. A softmax is applied to convert logits to probabilities. For example, after input "I have a dream", the model might assign 17% probability to the next token being " of" (Decoding Strategies in Large Language Models).
Decoding (Post-Inference): Given the output probabilities, a decoding strategy is used to choose the next token (and subsequent tokens) when generating text. Decoding is not a learned part of the model but a procedural step. Strategies include:
- Greedy decoding: pick the highest-probability token at each step (fast but can be short-sighted).
- Beam search: explore multiple candidate sequences in parallel and choose the highest overall probability sequence (helps find more likely sentences, often used in translation) (Decoding Strategies in Large Language Models).
- Random sampling: randomly sample the next token according to the probability distribution (used to generate varied or creative text).
- Top-k or Nucleus sampling: sample only from the top k tokens or the top probability p mass (respectively) at each step to avoid low-probability outputs () (). The decoding process continues iteratively: each chosen token is appended to the input and fed back into the model (using the model’s autoregressive nature) to predict the following token, until an end-of-sequence token is produced or a length limit is reached.
Output Post-Processing: The sequence of generated token IDs is finally converted back to human-readable text using the tokenizer’s vocabulary mapping. This involves merging subwords and symbols back into strings (inverse of the tokenization step). Any special tokens (like end-of-sequence markers) are removed or converted appropriately. The final output is a text sequence produced by the LLM. For example, the tokens generated for "I have a dream" → " of" → " being" → " a" → " doctor" → "." would be merged into "I have a dream of being a doctor." as the output sentence.

2. Comparison of Major Architectures (GPT, BERT, T5, Mixtral, etc.)

Large Language Models can have different architectural flavors despite all using the Transformer under the hood. Key differences include whether they are decoder-only or encoder-only, whether they use unidirectional or bidirectional attention, or even more advanced setups like Mixture-of-Experts. Below we compare several major LLM architectures in terms of cost, accuracy, efficiency, training and deployment.

GPT (Generative Pre-Trained Transformer, decoder-only autoregressive)

Type & Purpose: GPT models use a decoder-only Transformer. They are trained to predict the next token given all previous tokens (autoregressive language modeling). e.g. GPT-2, GPT-3, GPT-4 are in this family (GPT-3 - Wikipedia) (GPT-3 - Wikipedia). They excel at text generation and can be prompted to perform many tasks (few-shot learning) by formulating tasks as text completion.
Size & Cost: These models are extremely large. GPT-2 had 1.5 billion parameters (GPT-3 - Wikipedia), while GPT-3 has 175 billion parameters. Training GPT-3 from scratch required enormous compute – an estimated cost of ~$4.6 million USD on cloud GPUs (GPT-3 - Wikipedia). Newer GPT variants (GPT-3.5, GPT-4) are even larger or use specialized optimizations, further increasing training cost. Inference is also costly: each forward pass grows with model size and sequence length.
Accuracy & Capabilities: GPT models achieve high accuracy in generative tasks and can produce fluent, coherent passages of text. They set state-of-the-art in many language benchmarks (especially zero-shot or few-shot settings). However, being unidirectional, they don’t inherently see the full context on both sides of a token (no future context during training), so they are not directly optimized for understanding tasks like classification or span prediction (though they can be adapted with prompting or fine-tuning). GPT-3 and beyond demonstrate strong performance across QA, summarization, and dialogue after additional fine-tuning (e.g. instruction tuning, RLHF).
Computational Efficiency: Autoregressive generation means sequential inference – the model must generate tokens one by one, each time conditioning on all previous output. This can be slow for long outputs (O(n) forward passes to generate n tokens). However, the self-attention mechanism is highly parallelizable for computing next-token probabilities given the current context (all tokens of the context are processed in parallel per step) (Decoding Strategies in Large Language Models). GPT models use caching of past key/value vectors to avoid recomputing attention for prior tokens on each step (see Section 4), which greatly speeds up long-sequence generation. Still, the large number of layers and parameters makes GPT models memory- and compute-intensive.
Training Difficulty: Training is conceptually straightforward (next-word prediction on massive text data), but scaling to hundreds of billions of parameters is challenging. It requires distributed training across many GPUs/TPUs, careful optimization (gradient scaling, learning rate scheduling, etc.), and huge datasets. GPT models typically train on hundreds of billions of tokens of text. There is no need for labeled data (self-supervised), but ensuring training stability at scale is non-trivial. (OpenAI’s GPT-3 paper notes training on 45 TB of text data over many days on a supercomputer.)
Deployment & Usage: Deployment of GPT models can be challenging due to their size – GPT-3 (175B) typically runs on specialized hardware or behind an API. Distillation or quantization can compress the model at some accuracy cost. That said, smaller GPT variants (e.g. GPT-2 or distilled versions, or open-source 7B-13B models like LLaMA) can be deployed on a single GPU with enough memory. GPT models are general-purpose; users often interact with them via prompting. However, controlling their output can be difficult (they may produce factually incorrect or sensitive content), so techniques like prompt engineering or fine-tuning with human feedback are used to align them.

BERT (Bidirectional Encoder Representations from Transformers, encoder-only)

Type & Purpose: BERT uses an encoder-only Transformer stack (BERT (language model) - Wikipedia). It is trained with a bidirectional self-attention approach: the model sees all words in a sentence (both left and right context) simultaneously. BERT is not a generative model; instead, it is pre-trained via masked language modeling (MLM) and next sentence prediction (NSP) objectives (BERT (language model) - Wikipedia). It’s primarily used to produce rich contextual embeddings for text, which can then be fine-tuned for downstream tasks like classification, extraction, etc. (e.g. question answering, sentiment analysis).
Size & Cost: BERT comes in manageable sizes: BERT_base has ~~110 million parameters and BERT_large has ~340 million (BERT (language model) - Wikipedia#:~~:text=BERT%20was%20originally%20implemented%20in,7)). These sizes are tiny compared to GPT-3. The training cost is much lower; Google reported pre-training BERT_large on 64 TPU chips for 4 days (Speeding up BERT - Grigory Sapunov - Medium) (BERT (language model) - Wikipedia). In cloud-compute terms, that’s on the order of a few thousand dollars (one estimate put BERT’s training cost around $6k total) (The Staggering Cost of Training SOTA AI Models | Synced). The training corpus was ~~3.3 billion words (BooksCorpus + English Wikipedia) (BERT (language model) - Wikipedia#:~~:text=BERT%20was%20originally%20implemented%20in,7)), which is large but far less than GPT’s text data. Because of BERT’s moderate size, inference is faster and less memory-intensive than GPT-style models, and it can often run on commodity GPUs or even CPUs for smaller versions.
Accuracy & Capabilities: BERT achieved state-of-the-art accuracy on many NLP tasks when it was introduced (GLUE benchmark, SQuAD Q&A, etc.), thanks to its bidirectional understanding of text (BERT (language model) - Wikipedia). By seeing an entire sentence, BERT can capture nuanced context (e.g. disambiguate word sense, resolve coreference) that left-to-right models might miss. However, BERT is not designed for text generation; it cannot naturally continue a sentence. Instead, it excels at tasks like predicting a masked word in a sentence, or producing a vector for the entire sentence (via the [CLS] token) for classification. BERT’s representations can be fine-tuned for specific tasks with relatively little data. Its accuracy on downstream tasks is high, though often later surpassed by larger or more specialized models.
Computational Efficiency: BERT processes the entire input sequence in one forward pass (no autoregressive decoding needed). This makes inference efficient for tasks like classification or token tagging – just feed in the sentence, and get an output. The self-attention in BERT is quadratic in sequence length (like all Transformers) – e.g. O(n^2) for sequence length n – but for reasonably short texts (a few hundred tokens), this is manageable. There’s no need to cache or generate step by step as in GPT. Training BERT can be done in parallel across the whole sequence (thanks to bidirectional attention, but that necessitates the special MLM training approach). One downside is that BERT must be fine-tuned separately for each task (in the original paradigm), meaning separate models for QA, classification, etc., whereas GPT-style models can often perform many tasks via prompting.
Training Difficulty: BERT’s training involves masked language modeling, where 15% of tokens are randomly masked and the model must predict them (BERT (language model) - Wikipedia), and a secondary task (NSP) to predict if two sentences are consecutive in the original text (BERT (language model) - Wikipedia). These tasks require some data preprocessing (masking strategy) and the NSP task was later found to be perhaps unnecessary (RoBERTa removed it). Training is stable for the sizes of BERT; it was one of the first to show the effectiveness of extremely large batch sizes (like batch of 256k tokens) and long training schedules. Compared to GPT, BERT’s bidirectional training is trickier to implement (because you can’t just do a left-to-right generation; you need to create the masking labels), but once implemented, it converges fast on rich textual understanding. Many variations (RoBERTa, ALBERT, etc.) streamlined these training aspects.
Deployment & Usage: BERT is widely deployed for NLP understanding tasks. Because it’s not generative, it’s used in applications like search (to understand queries), text classification, etc. Deployment is relatively easy: BERT_base can fit on a single GPU and even on mobile in a distilled form. There are many optimized versions (distilled BERT, quantized INT8 BERT) to run efficiently in production. One challenge is that for each task, a separate fine-tuned model might be needed (though frameworks like multi-task training or prompting have been explored to use a single model for multiple tasks). BERT models output fixed-size vector representations that are useful features for downstream systems. The fact that BERT is bidirectional also means it requires the full text upfront (it’s not suitable for streaming or incremental output scenarios).

T5 (Text-To-Text Transfer Transformer, encoder-decoder sequence-to-sequence)

Type & Purpose: T5 uses a encoder–decoder Transformer architecture (like the original Transformer in machine translation) (BERT (language model) - Wikipedia). It was designed with a “text-to-text” paradigm, meaning every task (translation, summarization, classification, QA, etc.) is cast as feeding some text input into the model and generating some text output. For example, for sentiment classification, the input might be "sst2 sentence: I love this movie." and the output: "positive". This unified approach allows a single T5 model to be fine-tuned on a variety of tasks using the same interface. During inference, the encoder processes the input text into a sequence of vectors, and the decoder generates the output text token-by-token (with cross-attention to the encoder output).
Size & Cost: T5 comes in multiple sizes; the largest original model, T5-11B, has 11 billion parameters (Understanding the T5 Model: A Comprehensive Guide - Medium). This is smaller than GPT-3 but still very large. Training T5-11B from scratch is expensive (on the order of millions of GPU-hours); the researchers used a massive dataset called C4 (~750GB of cleaned web text) and also pre-trained on multiple tasks. A published figure is that a smaller T5 (3B) took about 10^2 petaflop/s-days of compute to train, which implies the 11B model took even more (Costs to finetune T5 - Generative AI with Large Language Models). However, because T5 uses an encoder-decoder (which doubles the layers, but each half can be smaller than an equivalent decoder-only model for the same parameter count), and because 11B < 175B (GPT-3), the training cost, while high, is more attainable for academic or industry labs than GPT-3’s was. Inference cost depends on the task – e.g., generating a short summary might be quite fast, but generating a long paragraph or translation requires autoregressive decoding like GPT. T5-11B in particular is memory-heavy; it may require model parallelism to serve.
Accuracy & Capabilities: T5 achieved state-of-the-art on a variety of NLP tasks after fine-tuning, thanks to the flexibility of the text-to-text framework. It can handle generative tasks (summarization, translation) as well as classification/regression tasks (by outputting a token or string that represents a label or value). T5’s multi-task pre-training (and a variant FLAN-T5 with instruction tuning) gives it strong zero-shot and few-shot abilities as well. In terms of accuracy, T5 models are very competitive; a properly fine-tuned T5-large or T5-XXL often performs on par with similarly sized BERT/RoBERTa on GLUE or SQuAD, and T5 is superior in any task requiring text generation. One trade-off: the encoder-decoder structure means for very short inputs and outputs, one might be doing more work than necessary (encoding and decoding) compared to a single-pass model.
Computational Efficiency: Parallel encoding: T5’s encoder can process the input in parallel (like BERT) with bidirectional attention, which is efficient for absorbing the source text. Autoregressive decoding: The decoder side is like GPT, generating one token at a time with causal masked self-attention (plus attending to encoder outputs). So generation is sequential. The two-part architecture means at inference you do extra computations (the encoder pass) that a decoder-only model might not do if you feed it prompts directly. However, that encoder can be reused for multiple decodings if doing beam search. The overall compute per token might be higher than a pure decoder model of equal size because of cross-attention and the fact that encoder and decoder each have their own stack of layers. On the other hand, having a separate encoder allows longer inputs or more complex input processing without blowing up the decoding cost. T5, like other Transformers, has O(n^2) self-attention in each layer for sequences of length n. For very long inputs or outputs, this can be slow (T5 was originally trained with inputs up to 512 tokens, but newer models or variants can go higher).
Training Difficulty: T5’s training involved a mix of unsupervised span-masking (a variant of masked LM where contiguous spans are masked and replaced with a single mask token to be generated by the decoder) and supervised multi-task objectives. This “multi-task pre-training” is more complex to orchestrate than GPT’s single objective or BERT’s two objectives. It requires careful weighting of tasks, and the model has to learn to prepend task-specific prefixes to prompts (like "translate English to German:"). Ensuring the model doesn’t forget how to do the tasks or doesn’t focus too much on one task was a challenge addressed in the T5 work. From an optimization standpoint, training an encoder-decoder that’s this large means twice the layers of an encoder-only model, which demands more memory and sharding across devices. T5 used advanced optimizers and infrastructure (TPU pods). In summary, training is heavy but feasible with resources, and results in a very versatile model.
Deployment & Usage: T5 is often deployed in scenarios where generation is needed but with the ability to condition on input text (translation APIs, summarization tools, etc.). The text-to-text format means a single T5 model (especially after instruction fine-tuning, as in FLAN-T5) can handle many tasks via different prompts, which simplifies deployment (one model, many tasks). However, the largest T5 models (3B, 11B) are challenging to deploy due to memory and latency – they might require multiple GPUs and have high latency for long outputs. For real-time applications, smaller variants or distilled versions are used. The encoder-decoder structure also allows partial model reuse; for example, one could swap in a new decoder and fine-tune it for language generation in a different language while keeping the encoder, etc. Deployment ease thus varies: smaller T5 = easy, T5-XXL = quite difficult without a robust serving setup.

i] = -inf for each position i weights = softmax(scores, axis=1) # shape: (n, n), softmax along each query's scores output = weights @ V # shape: (n, d_v) return output # (in practice, output would be further projected by W_O in multi-head case)


**Explanation:** For each token (each row in Q), this finds an attention weight for every other token (columns in that row of `scores`). The softmax turns those into a distribution. `output[i]` is essentially $\sum_j \alpha_{i,j} V_j$ as discussed. In a real multi-head attention, $d_k = d_v = d_{model}/h$ for $h$ heads, and we would do this $h$ times and combine. Also, there is typically a final linear layer $W^O$ so that the output is back to size $d_{model}$. The `MASKED` flag indicates whether we apply a causal mask (as in GPT-style decoder) – this ensures token $i$ can only attend to positions $\le i$ (preventing cheating by looking ahead). In code, `apply_causal_mask` would set `scores[i,j] = -∞` for $j > i$ before softmax, so those weights become 0. ([Decoding Strategies in Large Language Models](https://huggingface.co/blog/mlabonne/decoding-strategies)) ([Decoding Strategies in Large Language Models](https://huggingface.co/blog/mlabonne/decoding-strategies)). 

The result of `self_attention(X)` is then typically added to the original $X$ (residual connection) and fed into layer normalization. Then it goes into the FFN sub-layer. The combination yields the output of one Transformer layer for sequence X.

### 4.2 Tokenization and Embedding Lookup

This pseudocode shows how raw text is turned into embeddings that go into the model (pre-inference steps):

```python
# Given an input string 'text' and a tokenizer (with vocab and encode/decode methods)
# and an embedding matrix 'E' of shape (vocab_size, d_model).
def tokenize_and_embed(text):
    # 1. Tokenization (text -> tokens -> ids)
    tokens = tokenizer.tokenize(text)            # e.g. ["Transform", "ers", " are", " great", "!"]
    token_ids = tokenizer.convert_tokens_to_ids(tokens)
    # 2. Append any special tokens if required (for example, for BERT:
    # token_ids = [CLS_id] + token_ids + [SEP_id])
    # 3. Embedding lookup (ids -> vectors)
    embeddings = [ ]
    for pos, tok_id in enumerate(token_ids):
        # Get the token's embedding vector
        word_vec = E[tok_id]            # shape: (d_model,)
        # Get positional encoding for this position (pos) if using static pos encodings
        pos_vec = positional_encoding(pos, d_model)  # shape: (d_model,)
        embeddings.append(word_vec + pos_vec)
    return embeddings   # List of d_model-dimensional vectors, length = number of tokens (incl. specials)

Explanation: We first tokenize the input text. The tokenizer could be based on Byte-Pair Encoding, WordPiece, or another subword algorithm. It splits text into a list of token strings and then maps each to an integer ID (using a vocabulary dictionary) (Decoding Strategies in Large Language Models). Then, we create the embedding for each token by indexing into the embedding matrix E. The embedding matrix is essentially a lookup table: each row E[i] is the learned vector for token with ID i (How does nn.Embedding work? - PyTorch Forums). We also add a positional encoding for each token’s index (here shown as a function positional_encoding(pos) – this could retrieve a precomputed sinusoidal vector or a learned pos embedding from another matrix). The result is a sequence of vectors that combine the token identity and position. These vectors are what we feed into the first Transformer layer.

(Note: In practice, modern Transformer implementations often incorporate the addition of positional encoding internally, and some use learned positional embeddings or relative position schemes. But the above illustrates the general process.)

4.3 Beam Search Decoding (for sequence generation)

Beam search is a strategy to decode the most likely output sequence from a language model by keeping track of multiple hypotheses. Unlike greedy decoding (which picks 1 best token at each step), beam search keeps B hypotheses at each step (beam width = B). Below is a simplified beam search pseudocode for an autoregressive LLM:

def beam_search_generate(model, input_ids, beam_width=3, max_steps=50):
    # Start with the initial sequence (the prompt/input) for each beam
    beams = [(input_ids, 0.0)]  # list of tuples: (current_sequence, log_probability)
    completed_sequences = [ ]
    for step in range(max_steps):
        new_beams = [ ]
        for seq, logp in beams:
            # If a beam already ended with EOS token, carry it over
            if seq[-1] == EOS_token_id:
                completed_sequences.append((seq, logp))
                continue
            # 1. Run model to get next-token probabilities
            logits = model(seq)                # get logits for next token (shape: [vocab_size])
            probs = softmax(logits)           # convert to probabilities
            # 2. Get the top B token candidates for expansion
            top_tokens = argsort(probs)[-beam_width:]  # indices of the B highest probabilities
            for token in top_tokens:
                new_seq = seq + [token]
                new_logp = logp + log(probs[token] + 1e-9)
                new_beams.append((new_seq, new_logp))
        # 3. Prune to keep only top B sequences by log probability
        new_beams.sort(key=lambda x: x[1], reverse=True)
        beams = new_beams[:beam_width]
        # If all beams are completed (ended with EOS), we can break early
        if all(seq[-1] == EOS_token_id for seq, _ in beams):
            break
    # Combine still-active beams into completed_sequences
    completed_sequences.extend(beams)
    # Return the sequence with the highest log probability
    best_seq, best_score = max(completed_sequences, key=lambda x: x[1])
    return best_seq

Explanation: We maintain a list of up to beam_width sequences at each step. Initially, it’s just the prompt (with its log-probability 0.0). At each step, each beam sequence is expanded by all possible next tokens – in practice we take the top B expansions for each to limit branching. That gives up to B * B new sequences, from which we select the top B overall (this is the beam pruning). We continue until we reach max_steps or until all beams have produced an end-of-sequence (EOS). Finally, we output the highest probability sequence. The use of log probabilities (logp) is to sum probabilities in a numerical stable way (since multiplying many probabilities can underflow). We also often apply a length normalization to avoid beams with many tokens getting unfairly low scores due to multiplying many probabilities (Decoding Strategies in Large Language Models) (Decoding Strategies in Large Language Models) – but that detail is omitted for simplicity. Beam search explores multiple hypotheses and can significantly improve results in tasks like translation or summarization where greedily selecting local best tokens can lead to suboptimal overall output (Decoding Strategies in Large Language Models). The downside is higher computation (beam width = 3 means 3× more sequences to evaluate than greedy) and the possibility of the model preferring generic high-probability outputs (beam search can sometimes favor safe, dull phrases). In practice, many implementations add tweaks like coverage penalty or diversity penalty when using beams. But the above core algorithm covers the main idea (Decoding Strategies in Large Language Models): at each step keep the best B sequences.

4.4 Nucleus Sampling (Top-p Sampling) Decoding

Nucleus sampling is a stochastic decoding method that chooses the next token from the smallest set of tokens that together have probability mass ≥ p (the “nucleus”). It’s also known as top-p sampling (). Here’s pseudocode for one step of nucleus sampling (in practice, you repeat this in a loop for each generated token):

def nucleus_sample_next_token(logits, p=0.9):
    probs = softmax(logits)                   # convert logits to probability distribution
    sorted_indices = argsort(probs)[::-1]     # indices of tokens sorted by prob desc
    sorted_probs = probs[sorted_indices]
    # Compute cumulative probabilities in sorted order
    cum_probs = numpy.cumsum(sorted_probs)
    # Find the cutoff index where cum_prob exceeds p
    cutoff_index = 0
    while cutoff_index < len(cum_probs) and cum_probs[cutoff_index] < p:
        cutoff_index += 1
    >= p
    # Define the nucleus set of tokens (indices 0 .. cutoff_index inclusive in sorted list)
    nucleus_indices = sorted_indices[:cutoff_index+1]
    nucleus_probs = probs[nucleus_indices]
    nucleus_probs = nucleus_probs / nucleus_probs.sum()   # re-normalize within nucleus
    # Randomly sample one token from the nucleus according to their probabilities
    next_token = random.choice(nucleus_indices, p=nucleus_probs)
    return next_token

Explanation: We first sort tokens by probability. For example, suppose the top probabilities are [0.5, 0.2, 0.1, 0.05, ...] etc. We then accumulate until we have say ≥90% total probability. If $p=0.9$, we include the most likely tokens until their sum ≥ 0.9 (). We then randomly select the next token from this set. This way, we limit sampling to a probable set of tokens, excluding the long tail of very unlikely tokens (which can produce incoherent text) () (). But unlike top-k sampling (which would pick the top k tokens by probability), nucleus sampling adapts the cutoff based on the distribution shape – sometimes only a few tokens have 90% of the mass (then it behaves like picking among top 2 or 3), sometimes the distribution is flatter (then more tokens are included). The pseudocode uses cumulative sum to find that cutoff index. The random.choice weighted by nucleus_probs then makes a random pick. This randomness means each run can produce a different continuation, which is desirable for open-ended generation (creative or varied responses). In production, one typically also adds a temperature parameter to flatten or sharpen the distribution before sampling (e.g. probs = softmax(logits / T) for some temperature T). Setting $T < 1$ makes the output more deterministic, while $T > 1$ makes it more random. Nucleus sampling with $p \approx 0.9$–0.95, often combined with $T \in [0.7, 1.0]$, is a common choice for chatbots and text generators to balance coherence and creativity ().

(Reference: The above algorithm is essentially described in Algorithm 1 of Holtzman et al. (2020) who introduced nucleus sampling () ().)

4.5 Optimization: Efficient Autoregressive Inference with KV Caching

During autoregressive generation (as with GPT or the T5 decoder), a naive implementation would rerun the entire Transformer on the whole sequence each time a new token is added, which is redundant. A key optimization is caching the Key and Value matrices from the self-attention computation for past tokens (Best Practices for Generation with Cache) (Best Practices for Generation with Cache). This way, at each new generation step, the model only computes attention for the new token against the past keys and values, rather than recomputing attention among past tokens themselves. We’ll outline this process:

# We maintain cache of past K and V for each layer across time steps.
# Assume we generate tokens one by one in a loop.
K_cache = [ [] for _ in range(num_layers) ]
V_cache = [ [] for _ in range(num_layers) ]

for t in range(total_tokens_to_generate):
    x_t = last_generated_token_embedding  # embedding for current token (d_model,)
    for ℓ in range(num_layers):
        # Layer ℓ - Self-Attention with caching
        q_t = x_t @ W^Q_ℓ              # (1, d_k)
        k_t = x_t @ W^K_ℓ              # (1, d_k)
        v_t = x_t @ W^V_ℓ              # (1, d_v)
        # Append new key, value to cache for this layer
        K_cache[ℓ].append(k_t)        # K_cache[ℓ] becomes shape (t, d_k) after appending
        V_cache[ℓ].append(v_t)        # V_cache[ℓ] becomes shape (t, d_v)
        # Compute attention of q_t over all cached keys (including itself)
        keys = stack(K_cache[ℓ])      # shape (t, d_k)
        values = stack(V_cache[ℓ])    # shape (t, d_v)
        attn_scores = (q_t @ keys.T) / sqrt(d_k)   # (1, t)
        attn_weights = softmax(attn_scores)        # (1, t)
        context_t = attn_weights @ values          # (1, d_v)
        # Output of attention + residual connection
        x_t = x_t + (context_t @ W^O_ℓ)            # project if multi-head combined, add residual
        # Layer norm and Feed-Forward
        x_t = LayerNorm(x_t)                      
        x_t = x_t + FFN_ℓ(x_t)                    # apply feed-forward and residual
        x_t = LayerNorm(x_t)
    # After final layer ℓ, x_t is the output representation for the new token
    logits = x_t @ W_vocab^T         # (1, vocab_size)
    next_token = sample_from_logits(logits)  # use e.g. greedy or sampling to choose token
    generated_sequence.append(next_token)
    last_generated_token_embedding = E[next_token]  # get embedding of the newly generated token

Explanation: We iterate generating one token at a time. For each new token $x_t$ (starting from the user-provided prompt or a special start token), each Transformer layer ℓ reuses the past keys and values. K_cache[ℓ] and V_cache[ℓ] grow over time. At step $t$, keys is effectively the matrix $K_{1:t}$ (all keys from previous tokens 1 to $t$), and similarly for values. We compute attention of the current token’s query $q_t$ against those keys (Best Practices for Generation with Cache). This yields an output context_t which is the attention result for the new token. We add the usual residual connection (the code above shows it as adding to x_t which initially still holds the token’s embedding or the output of previous layer – in practice, one might separate the residual add). Then we normalize and feed this through the FFN, etc. The cache means we never recompute keys or values for old tokens; we just fetch them. Without caching, generating an $n$-length output would be O($n^2$) forward passes (each pass attends over $n$ tokens). With caching, it is O($n$) passes, each attending over the growing history – which is O($n^2$) overall, but with a much smaller constant because the heavy transforms (computing keys/values for each token through all layers) are not redone for past tokens (Best Practices for Generation with Cache) (Best Practices for Generation with Cache). In practice, this yields large speedups for long sequences.

All major LLM implementations use caching for autoregressive generation (Best Practices for Generation with Cache). For example, Hugging Face Transformers’ generate() will return past_key_values and accept past_key_values as input for subsequent generate calls, implementing exactly this strategy. The KV cache grows until it hits the model’s max context length. One can also drop oldest tokens if sliding windows are used (some streaming approaches do this, but standard practice is to limit to max length).

Why it’s efficient: As an illustration, suppose we have a 24-layer model and we want to generate 100 tokens. Without cache, we would perform 100 forward passes of 24 layers over sequences of increasing length (average length ~50) → roughly 24* (sum from 1 to 100) ≈ 245050 = 121,200 “layer computations”. With cache, we do 100 passes of 24 layers, each with mostly constant cost except the dot-product with past keys that grows linearly – the total operations are roughly 24 (sum from 1 to 100 for attention dot-products) + 24100 for key/value projections. The projections part is 24100, and the attention part is 24*5050 (same order as before). So complexity-wise it’s still O(n^2) in time. However, the factor is much smaller because computing new keys/values and doing FFN (which dominate compute) are O(n) instead of O(n^2). In practice this means the model can generate with almost constant time per token, after an initial cost. Also, memory usage is reduced because we don’t need to store every intermediate for backprop (in inference we aren’t backpropagating) and we reuse the factored cache.

In code above, stack(K_cache[ℓ]) just makes a matrix of shape (t, d_k) of all past keys. In a real implementation we’d store that in a single tensor and append to it (rather than a Python list append). The attention mask is adjusted accordingly to include past tokens (Best Practices for Generation with Cache) (as noted: ensure dimensions match when concatenating past and present). The result is functionally equivalent to recomputing self-attention from scratch on the sequence each time, but much faster (Best Practices for Generation with Cache) (Best Practices for Generation with Cache).

Other Inference Optimizations: Besides caching, other techniques include:

Mixed precision: Using FP16/BF16 weights instead of FP32 to halve memory and often double speed (with minimal impact on accuracy).
Batching multiple tokens: Many implementations generate in parallel batches to utilize vectorization (though this introduces some latency vs strictly one-by-one generation).
Quantization: Converting model weights to int8 or int4 and performing calculations in lower precision can dramatically speed up inference on CPU and some GPUs. For example, int8 quantization can 2–4× accelerate with a small accuracy drop. This often requires calibration or advanced quantization algorithms to maintain quality.
Operator fusion and optimized kernels: Merging steps (like computing Q,K,V in one fused kernel, or fusing LayerNorm and matrix-multiply) can reduce memory bandwidth and increase throughput. Highly optimized libraries (NVIDIA’s FasterTransformer, ONNX Runtime, etc.) employ these.
Distributed inference: For extremely large models (hundreds of billions of params), the model can be sharded across multiple GPUs and the inference computations distributed. Careful scheduling is needed to overlap communication and computation (e.g., pipeline parallelism). This doesn’t speed up per-token latency but enables deployment of models that wouldn’t fit on a single device.
Prefix prompts optimization: If a long prompt is given, one can cache the prompt’s final activations and reuse them for different continuations rather than re-running the whole prompt for each different continuation (useful in certain applications). This is related to caching as well.

All these optimizations ensure that despite LLMs being very large, we can use them in practical scenarios. For instance, with caching and mixed precision, GPT-3.5 (with ~175B params) can generate reasonable lengths of text in a matter of seconds on modern hardware, whereas it would be orders of magnitude slower without such optimizations.

Prepared with OpenAI o1-pro & deep-research