Karpathy's 200-line art project — fully explained with live animated diagrams. Start in plain English, switch to technical detail any time.
import os, math, random random.seed(42) # same seed = same sequence every run
random.gauss (weight init), random.shuffle (data order), random.choices (inference sampling) — is fully deterministic from this point. Zero external dependencies: pure stdlib only, no numpy.
docs = [line.strip() for line in open('input.txt') if line.strip()] random.shuffle(docs) # break alphabetical order → prevent bias
line.strip() — strips whitespace AND filters empty strings via Python truthiness in one pass. Training cycles via docs[step % len(docs)]. 1000 steps / 32K docs = <4% data coverage per run.
uchars = sorted(set(''.join(docs))) # ['a','b',...,'z'] BOS = len(uchars) # id=26, special start/end signal vocab_size = len(uchars) + 1 # 27 total symbols
uchars.index(ch). Decode: uchars[id]. BOS is dual-role (SOS+EOS) — reduces vocab by 1 vs separate tokens. Training format: [BOS]+chars+[BOS]. No OOV tokens possible — closed vocabulary of 27.
class Value: self.data = data # the actual number self.grad = 0 # how responsible is this number for the final error? self._children = children # which numbers created this one? a+b → both parents receive the full upstream gradient (∂/∂a = 1, ∂/∂b = 1) a*b → each parent's blame depends on the other's value
__slots__ avoids per-instance __dict__ overhead (~200→64 bytes/node). Local grads captured as Python floats at forward time — second-order AD unsupported. += handles fan-out correctly: accumulates gradients from all downstream uses, implementing the multivariate chain rule.
self.grad = 1 # ∂loss/∂loss = 1 by definition (loss is 100% responsible for itself) for v in reversed(topo): # walk backward through the family tree child.grad += local_grad * v.grad # chain rule: upstream_grad × local_grad → accumulate into parent.grad
visited set prevents reprocessing shared nodes. += implements: ∂L/∂x = Σᵢ (∂L/∂yᵢ)(∂yᵢ/∂x) — multivariate chain rule for fan-out nodes.
def relu(x): return max(0, x) # positive numbers pass through unchanged; negative numbers become 0
wte[27×16] # lookup table: each letter → 16-number description wpe[16×16] # position table: each position → 16-number description x = letter_description + position_description
q = "what am I looking for from past letters?" k = "what does each past letter advertise?" v = "what is each past letter's actual content?" attention = how well q matches each k → weighted mix of v
original = x # save a copy before processing x = expand(x) # zoom in: 16 numbers → 64 numbers (more room to think) x = gate(x) # apply on/off switches x = compress(x)# zoom back out: 64 → 16 x = x + original # ← add the original back in (the skip connection)
scores = model_output # 27 scores, one per possible next letter probs = scores_to_percentages(scores) penalty = -log( probs[correct_letter] ) # Very confident about right answer → tiny penalty # Totally clueless → huge penalty
m[i] = 0.85*m[i] + 0.15*gradient # running average of recent gradients (momentum) v[i] = 0.99*v[i] + 0.01*gradient² # running average of squared gradients (tracks gradient size) weight -= step_size * direction / wildness # step scaled by momentum / sqrt(variance) — larger for consistent gradients weight.gradient = 0 # reset for next round
p.grad=0 is manual zero-grad — must precede next backward.
for step in range(1000): name = pick_one_name() # e.g. "emma" error = predict_and_measure(name)# how wrong were we? blame = trace_backward(error) # who caused it? nudge = adam_update(blame) # fix the responsible numbers
step % len(docs) cycling — each document seen at most once across 1000 steps (3% data coverage). No gradient clipping, no regularisation — capacity limitation dominates at this scale.
probs = adjust_confidence(model_output, temperature) next_letter = randomly_pick(probs) # weighted by confidence if next_letter == STOP_SIGNAL: break # name is complete
| Temperature | Behaviour | Names you get |
|---|---|---|
| Low (0.1) | Almost always picks the most likely letter | Safe but boring, often repetitive |
| Medium (0.5) ← used here | Mostly picks likely letters, occasionally adventurous | Varied but still plausible |
| High (2.0) | Almost random picks | Creative but often nonsense |
| Step | What happens | Think of it as... |
|---|---|---|
| Letters → Numbers | Convert text to something the computer can work with | Morse code |
| Give letters identity | Replace each number with a rich description | Name badge → full profile |
| Read the past | Decide which previous letters matter most right now | Spotlight on relevant history |
| Think | Process what was just read | The "digest and decide" step |
| Measure the mistake | How wrong was the prediction? | Teacher's red pen |
| Trace the blame | Which numbers caused the mistake? | Audit trail |
| Fix the numbers | Nudge everything slightly toward being less wrong | GPS rerouting after a wrong turn |
microgpt.py — 200 lines, zero dependencies ├── Tokenizer char→int, BOS=26, vocab=27 ├── Autograd scalar DAG, reverse-mode AD ├── Architecture │ ├── tok_emb[27×16] + pos_emb[16×16] │ └── × 1 layer: │ ├── Pre-norm (RMSNorm) │ ├── Multi-head causal self-attn (4 heads, KV cache) │ ├── Residual add │ ├── Pre-norm + FFN: 16→64→ReLU→16 │ └── Residual add → lm_head[27×16] ├── Training cross-entropy, teacher forcing, 1000 steps ├── Optimizer Adam β₁=0.85 β₂=0.99, linear LR decay └── Inference temperature=0.5, ancestral sampling 4,192 params total. GPT-3 has 175,000,000,000. Same core algorithm. Different scale, depth, and engineering. The training loop, attention mechanism, and gradient descent are identical in principle.