A tiny GPT you can train right in your browser
A language model predicts what comes next in a sequence. Given the letters "mar", it might predict "i" (for "maria") or "k" (for "mark"). This model works at the character level: each token is a single letter.
The tokenizer converts text into numbers the model can process. Each unique character (a–z) gets an ID. A special BOS (Beginning Of Sequence) token marks the start and end of each name.
// Build vocabulary: each unique character gets an ID const charSet = new Set(); for (const doc of docs) for (const ch of doc) charSet.add(ch); uchars = Array.from(charSet).sort(); // BOS token ID = number of characters BOS = uchars.length; vocabSize = uchars.length + 1; // Tokenize: "emma" → [BOS, 4, 12, 12, 0, BOS] function tokenize(doc) { const tokens = [BOS]; for (const ch of doc) tokens.push(uchars.indexOf(ch)); tokens.push(BOS); return tokens; }
Neural networks learn by computing gradients — how much each weight contributed to the error. The Value class wraps every number in the model, recording every operation in a computation graph. Calling backward() walks this graph in reverse, applying the chain rule to compute all gradients automatically.
class Value { constructor(data, children = [], localGrads = []) { this.data = data; // scalar value this.grad = 0; // gradient (filled by backward) this._children = children; this._localGrads = localGrads; } // Each operation records the local derivative add(other) { // d(a+b)/da = 1, d(a+b)/db = 1 return new Value(this.data + other.data, [this, other], [1, 1]); } mul(other) { // d(a*b)/da = b, d(a*b)/db = a return new Value(this.data * other.data, [this, other], [other.data, this.data]); } backward() { // 1. Topological sort (children before parents) // 2. Walk in reverse, chain rule: // child.grad += localGrad * this.grad ... } }
The transformer is the architecture behind GPT. For each position, attention lets the model look at all previous positions and decide what information to gather. Multiple "heads" attend to different patterns simultaneously (e.g., one head for adjacent letters, another for vowel patterns).
// For each attention head: for (let h = 0; h < N_HEAD; h++) { // Dot product of query with each key // → attention scores (how relevant is each position?) scores[t] = dotProduct(q_head, k_head[t]) / sqrt(HEAD_DIM); // Softmax → attention weights (sum to 1) attnWeights = softmax(scores); // Weighted sum of values → output // (gather information from attended positions) output[d] = sum(attnWeights[t] * values[t][d]); } // Also includes: residual connections, RMSNorm, // and an MLP (feed-forward) block after attention
Each training step: (1) forward pass to get predictions, (2) compute cross-entropy loss (how wrong the predictions are), (3) backward pass to get gradients, (4) Adam optimizer updates weights. Adam adapts the learning rate per-parameter using running averages of gradients.
for (let step = 0; step < numSteps; step++) { // Forward: predict next character at each position for (let pos = 0; pos < n; pos++) { logits = gpt(tokens[pos], pos, keys, values); probs = softmax(logits); loss += -log(probs[target]); // cross-entropy } loss.backward(); // compute all gradients // Adam update: adaptive learning rate per parameter for (p of params) { m[i] = beta1 * m[i] + (1-beta1) * p.grad; // momentum v[i] = beta2 * v[i] + (1-beta2) * p.grad**2; // velocity p.data -= lr * mHat / (sqrt(vHat) + eps); p.grad = 0; // reset for next step } }
To generate a name: start with the BOS token, predict the next character, sample from the distribution (controlled by temperature), feed it back in, and repeat until the model outputs BOS again (end signal).
let token = BOS; // start signal for (let pos = 0; pos < BLOCK_SIZE; pos++) { logits = gpt(token, pos, keys, values); // Temperature: divide logits before softmax // Low temp → peaked distribution (conservative) // High temp → flat distribution (creative) scaled = logits.map(l => l / temperature); probs = softmax(scaled); token = weightedChoice(probs); // sample next if (token === BOS) break; // end signal name += uchars[token]; }
A structured path from the fundamentals to building your own GPT. Each lesson builds on the previous one.
A tiny GPT built from scratch in vanilla JavaScript. Inspired by Karpathy's microgpt.py. Train a character-level language model on names, then generate new ones.