Overview

Concept

A Transformer decoder is a function that takes a sequence of tokens and turns each one into a probability distribution over the next token. Stack enough of them, train on enough text, and you get GPT.

For us, the model is small on purpose: a handful of attention heads, a couple of blocks, a 64-character alphabet. You can fit the whole thing on a screen and watch every multiplication.

The pipeline you'll meet across the next six sections:

Token + positional embeddings turn each character into a vector.
Masked self-attention lets each position look back at the earlier ones (never forwards — that's the "decoder" / causal part).
A feed-forward network processes each position independently.
Residual connections + LayerNorm keep training stable.
Stacking does steps 2–4 over and over again.
Sampling picks the next token from the final logits.

The point of this site is that you don't have to take any of that on trust. Every operation has a widget you can poke at.

Maths

The decoder defines a function f : Z^S → R^(S×V), where S is the sequence length and V is the vocabulary size. For each position i, f(x)_i is a vector of unnormalised log-probabilities — "logits" — over the next-token distribution.

Internally, f is built from N identical blocks, each composed of a masked multi-head self-attention sub-layer and a position-wise feed-forward sub-layer, both wrapped in a residual connection and a LayerNorm:

Block(x) = x + Attn(LN_1(x))      // residual around attention
Block(h) = h + FFN(LN_2(h))       // residual around ffn

The pre-norm placement — LayerNorm before each sub-layer — is the choice GPT-2 made. It trains more stably than the original post-norm in "Attention is all you need".

Code

The TypeScript that actually runs:

// src/lib/transformer/model.ts (simplified)
export function forwardTyped(tokenIds, config, weights) {
  const tokEmb = tokenEmbedding(tokenIds, weights.tok_emb);
  const posEmb = positionalEncoding(config.seq_len, config.d_model);
  let x = addMat(tokEmb, posEmb);

  for (let i = 0; i < config.n_blocks; i++) {
    x = block(x, weights.blocks[i]!, config.n_heads);
  }

  const xFinal = layernormRows(
    x,
    weights.ln_final.gamma,
    weights.ln_final.beta,
  );
  const logits = matmul(xFinal, transpose(weights.tok_emb)); // tied head
  return { tokEmb, posEmb, xFinal, logits };
}

Each helper (tokenEmbedding, block, layernormRows, matmul, transpose) lives in its own file under src/lib/transformer/. Read them in the order the function calls them and the whole model fits in a sitting.

Comments