Concept
A Transformer decoder is a function that takes a sequence of tokens and turns each one into a probability distribution over the next token. Stack enough of them, train on enough text, and you get GPT.
For us, the model is small on purpose: a handful of attention heads, a couple of blocks, a 64-character alphabet. You can fit the whole thing on a screen and watch every multiplication.
The pipeline you'll meet across the next six sections:
- Token + positional embeddings turn each character into a vector.
- Masked self-attention lets each position look back at the earlier ones (never forwards — that's the "decoder" / causal part).
- A feed-forward network processes each position independently.
- Residual connections + LayerNorm keep training stable.
- Stacking does steps 2–4 over and over again.
- Sampling picks the next token from the final logits.
The point of this site is that you don't have to take any of that on trust. Every operation has a widget you can poke at.
Maths
The decoder defines a function f : Z^S → R^(S×V), where S is the
sequence length and V is the vocabulary size. For each position i,
f(x)_i is a vector of unnormalised log-probabilities — "logits" — over the
next-token distribution.
Internally, f is built from N identical blocks, each composed of a
masked multi-head self-attention sub-layer and a position-wise
feed-forward sub-layer, both wrapped in a residual connection and a
LayerNorm:
Block(x) = x + Attn(LN_1(x)) // residual around attention
Block(h) = h + FFN(LN_2(h)) // residual around ffn
The pre-norm placement — LayerNorm before each sub-layer — is the choice GPT-2 made. It trains more stably than the original post-norm in "Attention is all you need".
Code
The TypeScript that actually runs:
// src/lib/transformer/model.ts (simplified)
export function forwardTyped(tokenIds, config, weights) {
const tokEmb = tokenEmbedding(tokenIds, weights.tok_emb);
const posEmb = positionalEncoding(config.seq_len, config.d_model);
let x = addMat(tokEmb, posEmb);
for (let i = 0; i < config.n_blocks; i++) {
x = block(x, weights.blocks[i]!, config.n_heads);
}
const xFinal = layernormRows(
x,
weights.ln_final.gamma,
weights.ln_final.beta,
);
const logits = matmul(xFinal, transpose(weights.tok_emb)); // tied head
return { tokEmb, posEmb, xFinal, logits };
}
Each helper (tokenEmbedding, block, layernormRows, matmul,
transpose) lives in its own file under src/lib/transformer/. Read them
in the order the function calls them and the whole model fits in a sitting.
Comments
Be the first to leave a comment on this section.
Sign in (top-right) to leave a comment.