microGPT — How a GPT Works

ACT 1 · SETUP & DATA

lines 1–6

🎲

Imports & Random Seed

▾

import os, math, random
random.seed(42)  # same seed = same sequence every run

🙋 Non-Technical

The code borrows three helper tools from Python's built-in library, then sets a starting point for all "random" choices made later. By fixing this starting point to 42, every run of the program makes the exact same choices in the exact same order — so two people running it on different computers get identical results.

🎲 Same starting point = same sequence every time

Two friends agree to shuffle a deck of cards using the exact same starting arrangement and the same shuffle method. They'll always end up with identical decks, even if they're in different cities.

💻 Technical

Seeds Python's Mersenne Twister PRNG. Every downstream call — random.gauss (weight init), random.shuffle (data order), random.choices (inference sampling) — is fully deterministic from this point. Zero external dependencies: pure stdlib only, no numpy.

lines 8–14

📚

Loading & Shuffling 32,000 Names

▾

docs = [line.strip() for line in open('input.txt') if line.strip()]
random.shuffle(docs)  # break alphabetical order → prevent bias

🙋 Non-Technical

The program reads roughly 32,000 real first names from a text file (emma, liam, sofia...) and then scrambles them into a random order. The scrambling step is important — if it studied names in alphabetical order, it would wrongly learn that names start with letters near the beginning of the alphabet more often than they really do.

🃏 Scrambling removes hidden ordering patterns

A teacher who only tests students on Monday mornings might conclude that students are always energetic and well-prepared. Testing on random days throughout the week reveals a more accurate picture.

💻 Technical

List comprehension with double-duty line.strip() — strips whitespace AND filters empty strings via Python truthiness in one pass. Training cycles via docs[step % len(docs)]. 1000 steps / 32K docs = <4% data coverage per run.

lines 16–20

🔢

Tokenizer — Letters Become Numbers

▾

uchars = sorted(set(''.join(docs)))  # ['a','b',...,'z']
BOS = len(uchars)               # id=26, special start/end signal
vocab_size = len(uchars) + 1    # 27 total symbols

🙋 Non-Technical

Computers can only work with numbers — they can't read letters directly. So the first step is to create a simple lookup table that assigns every letter a number: a=0, b=1, c=2... all the way to z=25. There's also one extra special number (26) that acts like a 🔔 bell — it means "a name is starting" at the beginning, and "this name is finished" at the end.

🔤 → 🔢 Watch "emma" become a list of numbers

It's exactly like a simple spy code where A=0, B=1, C=2. The message "CAB" becomes "2-0-1". The computer reads the numbers, not the letters.

💻 Technical

Character-level tokenization. Encode: uchars.index(ch). Decode: uchars[id]. BOS is dual-role (SOS+EOS) — reduces vocab by 1 vs separate tokens. Training format: [BOS]+chars+[BOS]. No OOV tokens possible — closed vocabulary of 27.

ACT 2 · HOW THE MODEL LEARNS FROM MISTAKES

lines 22–50

🔗

Numbers That Remember Where They Came From

▾

class Value:
    self.data = data            # the actual number
    self.grad = 0              # how responsible is this number for the final error?
    self._children = children   # which numbers created this one?

a+b  → both parents receive the full upstream gradient (∂/∂a = 1, ∂/∂b = 1)
a*b  → each parent's blame depends on the other's value

🙋 Non-Technical

Every number in this program is a special kind of number that doesn't just hold a value — it also remembers how it was created. If a number was made by adding two other numbers, it remembers both of those "parent" numbers. This record-keeping makes it possible to trace any mistake all the way back to its root cause.

🌳 Every calculation builds a family tree

Imagine a detailed receipts system at a restaurant. Every dish on the bill traces back to ingredients, which trace back to suppliers. If there's a problem with a dish, you can follow the paper trail all the way back to the source.

💻 Technical

Define-by-run dynamic computation graph — DAG built implicitly during the forward pass. __slots__ avoids per-instance __dict__ overhead (~200→64 bytes/node). Local grads captured as Python floats at forward time — second-order AD unsupported. += handles fan-out correctly: accumulates gradients from all downstream uses, implementing the multivariate chain rule.

lines 52–65

⬅️

Tracing Mistakes Backward to Fix Them

▾

self.grad = 1              # ∂loss/∂loss = 1 by definition (loss is 100% responsible for itself)
for v in reversed(topo):   # walk backward through the family tree
    child.grad += local_grad * v.grad  # chain rule: upstream_grad × local_grad → accumulate into parent.grad

🙋 Non-Technical

Once we know how wrong the model was, we trace backward through the entire family tree of calculations to figure out which numbers were responsible — and by how much. Every single number that contributed to the wrong answer gets a "blame score." Numbers with high blame scores tend to get larger adjustments — though the exact amount also depends on how the fixing step is tuned.

🌊 Blame flows backward — each number gets its responsibility score

A relay race where the baton was dropped. The team investigates: the last runner dropped it because of a bad grip → the bad grip happened because the handoff angle was wrong → the angle was wrong because the previous runner threw it incorrectly. Each person gets a blame score based on how much they contributed to the final mistake.

💻 Technical

Post-order DFS builds topological sort O(V+E). Reversing gives reverse topological order — guarantees v.grad has all upstream contributions when processed. visited set prevents reprocessing shared nodes. += implements: ∂L/∂x = Σᵢ (∂L/∂yᵢ)(∂yᵢ/∂x) — multivariate chain rule for fan-out nodes.

activation

⚡

The On/Off Switch That Makes the Model Smart

▾

def relu(x): return max(0, x)
# positive numbers pass through unchanged; negative numbers become 0

🙋 Non-Technical

This is a simple gate: if a number coming in is positive, it passes through unchanged. If it's negative, it gets replaced with zero. That's it. But this tiny rule is what gives the model its power. Without it, adding more layers would be completely pointless — the whole thing would be mathematically identical to having just one layer, no matter how deep you went. The gate is what allows the model to learn patterns that a simple straight-line rule could never capture.

⚡ Drag the slider — watch the gate open and close

A light switch in a circuit. Ten dimmers (sliders) in a row is still just one dimmer — you can only control brightness. But add actual on/off switches between them and suddenly you can create far more complex patterns — like a specific sequence of flashes that couldn't be expressed with dimmers alone.

💻 Technical

ReLU instead of GeLU (noted deviation from GPT-2). Hard zero creates activation sparsity — ~50% of neurons inactive per forward pass, acting as implicit regularisation. Dead neuron problem: units receiving permanently negative input never receive gradient. GeLU ≈ E[Bernoulli gate × input] — empirically outperforms ReLU in transformers (Hendrycks & Gimpel, 2016). Choice here is pedagogical simplicity.

ACT 3 · THE GPT BRAIN

lines 67–90

🗺️

Giving Every Letter a Rich Identity

▾

wte[27×16]  # lookup table: each letter → 16-number description
wpe[16×16]  # position table: each position → 16-number description
x = letter_description + position_description

🙋 Non-Technical

The number 4 just means the letter "e" — it's a label with no meaning of its own. Before the model can work with it, the number 4 gets swapped for a list of 16 learned numbers that capture what "e" means as a letter — the same 16 numbers every time "e" appears. Later, a separate step figures out how the context around it matters. Think of it like upgrading from a name badge (just a label) to a full profile card (personality, history, relationships). The letter's position in the name also gets its own profile, and the two are combined.

🎴 Watch a single letter become a rich 16-number profile

Before training: 'e' is just the number 4 — meaningless. After training: in larger, well-trained models, similar letters tend to end up with similar profiles — vowels cluster near each other, for example. At this tiny scale (16 numbers, 1000 steps) the structure is more limited, but the same principle applies.

💻 Technical

wte = 27×16 learned lookup table. wpe = 16×16 learned absolute position embeddings (not fixed sinusoidal). Summed not concatenated — network must linearly disentangle identity from position in downstream layers. No weight tying wte↔lm_head (GPT-2 ties them, saving params and empirically helping at scale).

lines 95–128

🧠

Attention — The Model Reads Its Own Past

▾

q = "what am I looking for from past letters?"
k = "what does each past letter advertise?"
v = "what is each past letter's actual content?"
attention = how well q matches each k → weighted mix of v

🙋 Non-Technical

When the model is about to predict the next letter in "brend___", it doesn't just look at the last letter — it scans back through all the previous letters and decides which ones are most relevant to the decision it's about to make. It shines a spotlight on the most useful past letters and pays less attention to the less relevant ones. The diagram below shows this spotlight moving through a name in real time.

🔦 Watch the attention spotlight move across letters

Like a chef mid-recipe who glances back at the recipe card. They don't re-read the entire recipe — they focus on the step that's most relevant right now. The model learned during training which patterns in past letters tend to be useful at each point — it encodes this as weight matrices, not rules anyone wrote by hand.

Why 4 attention heads? The model runs this spotlight process 4 times in parallel, each one looking for different kinds of patterns — each one may notice different things — one might respond to how a name starts, another to patterns near the end. All four results get combined.

📊 Attention scores as a grid — darker = more attention paid

💻 Technical

Causal self-attention with growing KV cache. Causality enforced structurally (sequential processing) rather than via attention mask. Scaling by 1/√head_dim prevents softmax saturation. 4 heads split the 16-dim space into 4×4 subspaces, outputs concatenated and projected through attn_wo. KV cache is live in the computation graph during training — gradients flow through cached keys/values, unlike inference-only KV caches.

lines 129–142

🔀

Thinking Deeper — and Keeping the Original

▾

original = x                 # save a copy before processing
x = expand(x)  # zoom in: 16 numbers → 64 numbers (more room to think)
x = gate(x)    # apply on/off switches
x = compress(x)# zoom back out: 64 → 16
x = x + original  # ← add the original back in (the skip connection)

🙋 Non-Technical

After the attention step reads the past, there's a "thinking" step. The model temporarily expands from 16 numbers to 64 — giving itself more working space to process what it just read — then compresses back down to 16. But crucially, it also adds back the original 16 numbers it started with before the thinking step. This "add the original back" trick turns out to be essential — without it, information gets scrambled and distorted as it flows through the layers, and the model struggles to learn anything useful.

🛣️ The original is added back after each transformation

You're editing a document. You make a copy first, then heavily revise the working draft. At the end, you merge your revisions back with the original. The layer only needs to write what it wants to change — the original is preserved by addition. This is why deep networks can be trained at all: gradients can flow backward through the addition directly to earlier layers without vanishing.

💻 Technical

Pre-norm architecture (RMSNorm before sub-layer) — matches LLaMA/Mistral, trains more stably than post-norm GPT-2. 4× FFN expansion is an empirically established heuristic across GPT-2, LLaMA, PaLM. ReLU sparsity ~50% acts as implicit regularisation. Residual connection provides O(1) gradient path to any layer, solving vanishing gradients in deep networks (He et al., 2015).

ACT 4 · TRAINING — GETTING BETTER STEP BY STEP

lines 148–165

📉

Measuring How Wrong the Model Is

▾

scores = model_output  # 27 scores, one per possible next letter
probs  = scores_to_percentages(scores)
penalty = -log( probs[correct_letter] )
# Very confident about right answer → tiny penalty
# Totally clueless → huge penalty

🙋 Non-Technical

After the model makes a prediction, we need to measure how wrong it was. The model outputs a confidence score for all 27 possible next letters. We look at how much confidence it gave to the correct letter. If it was very confident and correct, the penalty is tiny. If it gave the correct letter almost zero confidence, the penalty is huge. We want this penalty number to get as small as possible over time — that's the whole goal of training.

🎯 Drag to see how confidence changes the penalty

Like a quiz where you have to bet on your answer. If you bet 99% confidence on the right answer, you barely lose any points. If you bet only 1% on the right answer and it turns out to be correct, you lose a huge number of points. This forces the model to not just guess right, but to be confidently right.

💻 Technical

NLL loss with teacher forcing — model receives true previous token during training (not its own prediction). Average by n normalises for variable-length sequences. Softmax converts logits to a probability simplex; cross-entropy -log(p_target) measures KL divergence from the one-hot target. The computation graph spans the entire forward pass — backward() propagates through softmax, lm_head, all transformer layers, and into the embedding tables.

lines 168–185

⚙️

Adam — The Smart Way to Fix Mistakes

▾

m[i] = 0.85*m[i] + 0.15*gradient   # running average of recent gradients (momentum)
v[i] = 0.99*v[i] + 0.01*gradient²  # running average of squared gradients (tracks gradient size)
weight -= step_size * direction / wildness  # step scaled by momentum / sqrt(variance) — larger for consistent gradients
weight.gradient = 0  # reset for next round

🙋 Non-Technical

After measuring how wrong the model was and tracing which numbers caused it, we need to actually fix those numbers. The simplest approach would be to nudge everything by a fixed amount in the right direction. But Adam is smarter — it remembers recent corrections and adapts: bolder when consistently heading the same way, more cautious when bouncing around.

Important caveat: neither approach can see the full landscape. Both approaches only feel the slope right under their feet — one tiny step at a time. There could be a lower point just over a hill they never cross. This is why running the same training twice can give slightly different results, and why nobody can guarantee the model found the absolute best version of itself. In practice you just run it several times and keep the best.

🏔️ Finding a low point — not necessarily the lowest

Two hikers trying to find the lowest point in a foggy valley — neither can see the full terrain, only the slope right under their feet. The beginner takes the same size step every time and keeps overshooting. The experienced hiker (Adam) takes bigger steps on clear downhills and tiny careful steps near flat spots. Adam usually settles into a low point faster — but neither hiker can be sure they haven't missed a deeper valley hidden behind a hill they never crossed.

💻 Technical

Standard Adam (Kingma & Ba, 2015) with bias correction. β₁=0.85 (vs standard 0.9) — faster gradient forgetting appropriate for small models. β₂=0.99 smooths over ~100 steps. Bias correction m̂=m/(1-β₁ᵗ) corrects for zero-initialisation in early steps. Linear LR decay from 0.01→0 over 1000 steps. No weight decay (AdamW) — regularisation not needed at 4K params. p.grad=0 is manual zero-grad — must precede next backward.

lines 148–186

🔄

The Training Loop — 1000 Rounds of Practice

▾

for step in range(1000):
    name   = pick_one_name()          # e.g. "emma"
    error  = predict_and_measure(name)# how wrong were we?
    blame  = trace_backward(error)    # who caused it?
    nudge  = adam_update(blame)       # fix the responsible numbers

🙋 Non-Technical

Training is just this 4-step loop repeated 1000 times. Each round, the model studies one name, measures how badly it predicted each next letter, works out how much each of its 4,192 internal numbers contributed to the mistake, and nudges all of them slightly in the right direction. After 1000 rounds of this, those 4,192 numbers have been gradually nudged toward capturing patterns in real names. At this tiny scale the results are rough — the model generates plausible-looking names, not polished ones — but the same process — at vastly larger scale — is the foundation of every AI you've used. (ChatGPT, Claude, and Gemini have extra layers of training on top of this to make them helpful assistants, but they all start from exactly this loop.)

📊 The penalty score drops as the model learns

Learning to shoot free throws. Each attempt: take the shot (predict), see where it went (measure error), think about what you did wrong (trace blame), adjust your form slightly (nudge). After 1000 shots, your form has been incrementally refined by every attempt. The model is the same — each of the 4,192 numbers has been nudged 1000 times, once per round.

💻 Technical

Online learning, batch=1. Computation graph builds across the entire forward pass before backward() is called. O(T×d²) scalar ops per document — ~4 orders of magnitude slower than a PyTorch equivalent. KV cache reset each step. step % len(docs) cycling — each document seen at most once across 1000 steps (3% data coverage). No gradient clipping, no regularisation — capacity limitation dominates at this scale.

ACT 5 · GENERATING NEW NAMES

lines 188–205

✨

Temperature — Controlling Creative vs Safe

▾

probs = adjust_confidence(model_output, temperature)
next_letter = randomly_pick(probs)  # weighted by confidence
if next_letter == STOP_SIGNAL: break  # name is complete

🙋 Non-Technical

After training, the model generates names one letter at a time. It starts with a "begin" signal, outputs a confidence score for every possible next letter, picks one (weighted by confidence), then feeds that letter back in as the starting point for the next step — repeating until it outputs a "done" signal. The temperature setting controls how bold or cautious the picks are.

🌡️ Drag the slider — watch how temperature changes the choices

Temperature	Behaviour	Names you get
Low (0.1)	Almost always picks the most likely letter	Safe but boring, often repetitive
Medium (0.5) ← used here	Mostly picks likely letters, occasionally adventurous	Varied but still plausible
High (2.0)	Almost random picks	Creative but often nonsense

💻 Technical

Temperature T rescales logits before softmax: P_T(x) ∝ exp(logit/T). Division before softmax is equivalent to raising probabilities to power 1/T. As T→0: argmax (greedy decoding). Same code path as training — KV cache accumulates identically, no separate inference mode. No top-k or nucleus sampling — adequate for T≤15 sequences with small vocab.

THE BIG PICTURE

summary

🗺️

The Full Picture — End to End

▾

🙋 Non-Technical

🏭 One signal flowing through all stages

Step	What happens	Think of it as...
Letters → Numbers	Convert text to something the computer can work with	Morse code
Give letters identity	Replace each number with a rich description	Name badge → full profile
Read the past	Decide which previous letters matter most right now	Spotlight on relevant history
Think	Process what was just read	The "digest and decide" step
Measure the mistake	How wrong was the prediction?	Teacher's red pen
Trace the blame	Which numbers caused the mistake?	Audit trail
Fix the numbers	Nudge everything slightly toward being less wrong	GPS rerouting after a wrong turn

The key insight: "Everything else is just efficiency." — Andrej Karpathy.

Every AI you've used — ChatGPT, Claude, Gemini — is built on this same core idea: predict the next character, measure how wrong it was, trace which numbers caused it, nudge them to be slightly less wrong. The architectures have grown far more complex, but the fundamental loop is identical. The only real difference is scale — hundreds of billions of numbers instead of 4,192.

Karpathy's point: everything else is just efficiency.

💻 Technical

microgpt.py — 200 lines, zero dependencies
├── Tokenizer       char→int, BOS=26, vocab=27
├── Autograd        scalar DAG, reverse-mode AD
├── Architecture
│   ├── tok_emb[27×16] + pos_emb[16×16]
│   └── × 1 layer:
│       ├── Pre-norm (RMSNorm)
│       ├── Multi-head causal self-attn (4 heads, KV cache)
│       ├── Residual add
│       ├── Pre-norm + FFN: 16→64→ReLU→16
│       └── Residual add → lm_head[27×16]
├── Training        cross-entropy, teacher forcing, 1000 steps
├── Optimizer       Adam β₁=0.85 β₂=0.99, linear LR decay
└── Inference       temperature=0.5, ancestral sampling

4,192 params total. GPT-3 has 175,000,000,000.
Same core algorithm. Different scale, depth, and engineering.
The training loop, attention mechanism, and gradient descent are identical in principle.

How a GPT WorksLine by Line

How a GPT Works
Line by Line