MicroGPT.js

Getting Started

Start Training — Downloads 32K names, initializes a tiny transformer with ~4,200 parameters, and trains it to predict the next character in each name.
Watch the loss — The loss chart shows cross-entropy loss over time. It should trend downward, meaning the model is learning.
Generate Names — After training, click this to sample 20 new names from the model.

Controls Explained

Training Steps — How many examples to train on. More steps = better results but slower. 200 is a quick demo; 500–1000 produces noticeably better names.
Temperature — Controls generation randomness. Low (0.1–0.3) = conservative, common-looking names. High (0.8–2.0) = creative, unusual names.

Tips

Retraining: Each click of "Start Training" creates a brand new model from scratch. Want to compare results? Try training with 100 steps, generate names, then train with 1000 steps and generate again.

Temperature experiment: After training, try generating at temperature 0.1 (very safe), then at 1.5 (very creative). Notice how the names change character.

What is a Language Model?

A language model predicts what comes next in a sequence. Given the letters "mar", it might predict "i" (for "maria") or "k" (for "mark"). This model works at the character level: each token is a single letter.

Tokenization

The tokenizer converts text into numbers the model can process. Each unique character (a–z) gets an ID. A special BOS (Beginning Of Sequence) token marks the start and end of each name.

// Build vocabulary: each unique character gets an ID
const charSet = new Set();
for (const doc of docs)
  for (const ch of doc) charSet.add(ch);
uchars = Array.from(charSet).sort();

// BOS token ID = number of characters
BOS = uchars.length;
vocabSize = uchars.length + 1;

// Tokenize: "emma" → [BOS, 4, 12, 12, 0, BOS]
function tokenize(doc) {
  const tokens = [BOS];
  for (const ch of doc)
    tokens.push(uchars.indexOf(ch));
  tokens.push(BOS);
  return tokens;
}

The Value Class & Autograd

Neural networks learn by computing gradients — how much each weight contributed to the error. The Value class wraps every number in the model, recording every operation in a computation graph. Calling backward() walks this graph in reverse, applying the chain rule to compute all gradients automatically.

class Value {
  constructor(data, children = [], localGrads = []) {
    this.data = data;      // scalar value
    this.grad = 0;         // gradient (filled by backward)
    this._children = children;
    this._localGrads = localGrads;
  }

  // Each operation records the local derivative
  add(other) { // d(a+b)/da = 1, d(a+b)/db = 1
    return new Value(this.data + other.data,
      [this, other], [1, 1]);
  }
  mul(other) { // d(a*b)/da = b, d(a*b)/db = a
    return new Value(this.data * other.data,
      [this, other], [other.data, this.data]);
  }

  backward() {
    // 1. Topological sort (children before parents)
    // 2. Walk in reverse, chain rule:
    //    child.grad += localGrad * this.grad
    ...
  }
}

Attention & Transformers

The transformer is the architecture behind GPT. For each position, attention lets the model look at all previous positions and decide what information to gather. Multiple "heads" attend to different patterns simultaneously (e.g., one head for adjacent letters, another for vowel patterns).

// For each attention head:
for (let h = 0; h < N_HEAD; h++) {
  // Dot product of query with each key
  // → attention scores (how relevant is each position?)
  scores[t] = dotProduct(q_head, k_head[t]) / sqrt(HEAD_DIM);

  // Softmax → attention weights (sum to 1)
  attnWeights = softmax(scores);

  // Weighted sum of values → output
  // (gather information from attended positions)
  output[d] = sum(attnWeights[t] * values[t][d]);
}

// Also includes: residual connections, RMSNorm,
// and an MLP (feed-forward) block after attention

Training & Adam Optimizer

Each training step: (1) forward pass to get predictions, (2) compute cross-entropy loss (how wrong the predictions are), (3) backward pass to get gradients, (4) Adam optimizer updates weights. Adam adapts the learning rate per-parameter using running averages of gradients.

for (let step = 0; step < numSteps; step++) {
  // Forward: predict next character at each position
  for (let pos = 0; pos < n; pos++) {
    logits = gpt(tokens[pos], pos, keys, values);
    probs  = softmax(logits);
    loss  += -log(probs[target]);  // cross-entropy
  }

  loss.backward();  // compute all gradients

  // Adam update: adaptive learning rate per parameter
  for (p of params) {
    m[i] = beta1 * m[i] + (1-beta1) * p.grad;   // momentum
    v[i] = beta2 * v[i] + (1-beta2) * p.grad**2; // velocity
    p.data -= lr * mHat / (sqrt(vHat) + eps);
    p.grad = 0;  // reset for next step
  }
}

Inference / Generation

To generate a name: start with the BOS token, predict the next character, sample from the distribution (controlled by temperature), feed it back in, and repeat until the model outputs BOS again (end signal).

let token = BOS;  // start signal

for (let pos = 0; pos < BLOCK_SIZE; pos++) {
  logits = gpt(token, pos, keys, values);

  // Temperature: divide logits before softmax
  // Low temp → peaked distribution (conservative)
  // High temp → flat distribution (creative)
  scaled = logits.map(l => l / temperature);
  probs  = softmax(scaled);

  token = weightedChoice(probs); // sample next
  if (token === BOS) break;     // end signal
  name += uchars[token];
}

A structured path from the fundamentals to building your own GPT. Each lesson builds on the previous one.

Lesson 1

Neural Networks from Scratch

What are neurons, layers, and activation functions? How does a network "learn" by adjusting weights?

3Blue1Brown: Neural Networks Karpathy: Zero to Hero

Lesson 2

Backpropagation & Autograd

How the chain rule computes gradients through a computation graph. This is exactly what our Value class does.

Karpathy: micrograd 3Blue1Brown: Backprop

Lesson 3

Language Modeling & makemore

Building character-level language models from bigrams to MLPs. The same dataset (names.txt) used in this project.

Karpathy: makemore pt. 1 Karpathy: makemore pt. 2

Lesson 4

Attention & Transformers

The "Attention Is All You Need" mechanism that powers GPT. How queries, keys, and values let the model focus on relevant context.

Illustrated Transformer Original Paper

Lesson 5

Building GPT from Scratch

Putting it all together: embeddings, multi-head attention, MLP blocks, and training. This is what our model.js implements.

Karpathy: Let's build GPT microgpt.py (inspiration)

Lesson 6

Tokenization Deep Dive

Real LLMs use subword tokenizers (BPE). Our character-level approach is simpler but the concepts transfer.

Karpathy: Tokenization

Welcome to MicroGPT.js

Learn

Getting Started

Controls Explained

Tips

What is a Language Model?

Tokenization

The Value Class & Autograd

Attention & Transformers

Training & Adam Optimizer

Inference / Generation