Understand how Claude, GPT, and Gemini actually work — from high school math to transformer architecture. No prerequisites. Every concept is interactive.
Everything in AI starts here. A function is a machine: input goes in, output comes out.
Hover the graph. Your mouse position = input x. The dot = output f(x).
Where the curve is steep, small input changes cause big output changes. Where it's flat, output barely moves.
The dashed crosshair tracks your position. The coordinates update in real-time below.
Claude GPT-4 Gemini are all massive functions: text goes in → text comes out. Training = finding the right function.
How fast is the output changing? That's what derivatives tell us — the foundation of how AI learns.
Nudge the input a tiny bit — how much does the output change? That ratio is the derivative.
Drag the yellow dot along the curve. The red line is the tangent — its angle IS the slope at that point.
For f(x) = x², the derivative is 2x. At x=3, slope=6 — the output changes 6× faster than the input.
Where slope = 0, the function has a minimum or maximum. AI uses this to find optimal answers.
The derivative tells Claude: "adjusting this parameter will improve the answer by exactly this much." It's the GPS of optimization.
Derivatives tell us which direction to go. Now imagine a landscape of errors — walking downhill finds the best answer.
THE algorithm behind all AI training. Follow the slope downward to minimize error.
Yellow ball = the model's current parameters. Dark blue = low error (good). Red/orange = high error (bad).
Red arrow points uphill (gradient direction). The ball moves the opposite way — downhill toward lower error.
White trail shows the path taken. Each white dot is one step. Watch it spiral into the valley.
Set learning rate to 0.60 — watch it overshoot! Set momentum to 0.90 — it builds up speed like a heavy ball. Set Steps to 10 for batch updates.
Training GPT-4 does gradient descent on trillions of parameters across thousands of GPUs. The same algorithm you see here — just at unimaginable scale.
You understand the math! Now: what is AI actually doing with all this?
The hierarchy from broad intelligence to the specific models you use every day.
Show examples with correct answers. The model learns to map inputs to outputs. Like studying with flashcards.
No answers given — find structure on its own. Discovers clusters, patterns, and anomalies in raw data.
Learn by trial and error. Take actions, receive rewards or penalties. Like training a dog with treats.
| Term | Plain English | In AI | Scale |
|---|---|---|---|
| Parameter | A knob you can turn | A learnable number in the model | GPT-4: ~1.8 trillion |
| Training | Practice makes perfect | Adjusting all parameters to reduce error | Months of GPU computation |
| Loss | How wrong you are | A number measuring prediction error | Lower = better model |
| Epoch | One study session | One pass through all training data | Models train for many epochs |
| Inference | Using what you learned | Running the trained model on new data | When you chat with Claude |
| Batch size | How many flashcards per round | Examples processed before updating weights | Typically 256 to 4096 |
Now let's see inside a neural network — the computational architecture that makes all of this possible.
Stack simple functions into layers. Each layer transforms data, learning increasingly abstract patterns.
Each circle = a neuron. It computes: output = activation(Σ(weight × input) + bias)
Green lines = positive weights (excitatory). When the input neuron fires, it encourages the connected neuron to fire too.
Red lines = negative weights (inhibitory). When the input fires, it suppresses the connected neuron. Thicker = stronger effect.
Hover any neuron to see its specific connections highlighted and its activation value.
A weight is a number on each connection that controls how much influence one neuron has on the next. Weight = 0.9 means "pass most of the signal through." Weight = -0.3 means "pass a little, inverted." Weight = 0 means "ignore completely." During training, the network adjusts every weight to reduce errors — that's what learning is.
Each neuron also has a bias — a number added after all the weighted inputs are summed. It shifts the neuron's activation threshold. Without bias, a neuron with all-zero inputs would always output zero. The bias lets it fire even with weak inputs, or stay quiet even with strong ones. Think of it as the neuron's "default mood."
output = activation(w₁×x₁ + w₂×x₂ + ... + wₙ×xₙ + bias)
Each input xᵢ is multiplied by its weight wᵢ, all products are summed, the bias is added, then an activation function squashes the result. This single formula — repeated billions of times across layers — is how Claude thinks.
3 input neurons receive pixel values [0.8, 0.2, 0.5]. Each value is multiplied by connection weights, summed at the next neuron, bias added, then activated. 2 output neurons produce probabilities like [0.9, 0.1] → "90% cat, 10% dog." Training adjusts every weight and bias until these outputs are correct.
Problem: stacking linear layers still gives a straight line. We need curves. That's what activation functions do.
Without these, a 100-layer network is no better than 1 layer. They give neural networks their power.
Without activations, multiplying matrices is always linear: y = Wx + b. Stacking 100 linear layers collapses to one. Activation functions add curves, letting the network model complex patterns.
ReLU: if negative → output 0. If positive → pass through. Simple, fast, most popular. Used in almost all modern networks.
Sigmoid: squashes everything to 0–1. Used when you need probabilities. "How confident is the model?"
Tanh: outputs -1 to +1. Centered at zero, which helps training converge faster.
GELU: smooth version of ReLU. The modern standard — used in Claude GPT-4 Gemini.
A linear function can only draw straight lines/flat planes — 2D boundaries. Non-linear activations let the network bend and twist decision boundaries in any dimension, modeling complex patterns like language, images, and reasoning.
How does the network learn? How do we adjust billions of parameters? Enter backpropagation.
The chain rule traces errors backward through every layer, computing each parameter's contribution to the mistake.
Forward pass (green signals): data flows left→right through each layer, producing a prediction at the output.
Backward pass (red signals): error gradients flow right→left. Each weight learns: "how much did I contribute to the mistake?"
Chain rule: ∂Loss/∂w = ∂Loss/∂output × ∂output/∂w. Multiply local gradients layer by layer.
After backward: weight = weight − lr × gradient. Every weight adjusts proportionally. Repeat billions of times.
Data flows through layers → prediction
Compare prediction to truth — how wrong?
Chain rule sends gradients back
Adjust every weight by its gradient
Billions of times across data
Before understanding Transformers, we need to know how AI reads text. It doesn't understand words — only numbers.
Text → sub-word pieces → high-dimensional number vectors. This is how AI reads.
Tokenization: text is split into sub-word pieces. "understanding" → ["under", "stand", "ing"]. This lets models handle any word — even made-up ones.
Embedding: each token is mapped to a vector of ~1000+ numbers. Similar-meaning tokens end up close together in this "meaning space."
Position encoding: adds information about WHERE each token appears. "Dog bites man" ≠ "man bites dog" — word order matters!
Claude 200K tokens. GPT-4 128K. Gemini up to 1M. One token ≈ ¾ of a word in English.
Tokens are now numbers. The breakthrough question: how should each token "look at" the others to understand context?
Instead of reading sequentially, attention lets every word look at every other word simultaneously.
Hover any word. The arcs show how much attention that word pays to every other word. Thicker = more attention.
The highlighted word is the "source." Percentages show attention weights — they always sum to 100%.
Notice: "cat" attends strongly to "curious" (its adjective) and "jumped" (its verb). The model learns which relationships matter from data.
In "The cat sat on the mat because it was tired" — what does "it" refer to? Attention lets the model look back and connect "it" to "cat" based on learned patterns. This is why transformers understand context so well.
Attention uses Queries, Keys, and Values — like searching a library. Let's build the full Transformer.
"Attention Is All You Need" (2017). The architecture behind Claude, GPT, and Gemini.
Query (Q): "What am I looking for?" — each token generates a query describing what context it needs.
Key (K): "What do I contain?" — each token generates a key describing its information.
Value (V): "Here's my content" — the actual information to retrieve when key matches query.
Score: Q·K dot product measures similarity. softmax normalizes to probabilities. Multiply by V to get weighted information.
Q = your search query. K = labels on each book. V = the book content. Attention finds which books match your query, then gives you a weighted summary of their contents.
Split text into tokens
Map to vectors
Q·K·V for each token
Process attended info
Stack deep
Next token probability
Transformer layers in frontier models
Attention heads per layer
Tokens seen during training
Learnable parameters
The standard transformer is the foundation, but teams have developed key variations to improve efficiency and capability.
Every token passes through every parameter in every layer. Simple, powerful, but computationally expensive. All parameters are active for every input.
Has many "expert" sub-networks, but a router picks only 2–4 experts per token. Total parameters are huge but compute per token is small. More efficient at scale.
Process sequences through a compressed state instead of attending to all tokens. Much faster for very long sequences. Linear complexity vs quadratic for attention.
| Architecture | Total params | Active per token | Strength | Trade-off |
|---|---|---|---|---|
| Dense | All | All (100%) | Maximum quality per parameter | Expensive to run — every param used every time |
| MoE | Very large (e.g. 600B) | Small subset (e.g. 40B) | Fast inference despite huge total size | Harder to train, needs careful routing |
| SSM/Mamba | Moderate | All | Very fast on long sequences | May lose fine-grained attention detail |
| Hybrid (SSM + Attention) | Varies | Varies | Best of both — speed + precision | Complex architecture design |
Imagine a hospital with 16 specialist doctors but each patient only sees 2 of them. A router (triage nurse) decides which specialists each patient needs. The hospital has the expertise of 16 doctors but the cost of running 2. That's MoE — a model like Mixtral 8x7B has 47B total parameters but only activates ~13B per token. DeepSeek-V3 uses 671B total but activates only ~37B. The result: frontier-level quality at a fraction of the compute cost.
Now that you understand the concepts, see a real GPT built line by line: How a GPT Works — Line by Line walks through Karpathy's 200-line microgpt.py with animated diagrams, showing exactly how tokens, attention, and backpropagation work in actual Python code.
Also explore: Transformer Explainer (live GPT-2 in browser) · LLM Visualization (3D walkthrough).
The model predicts a probability distribution over all possible next tokens. But how do we choose which token to actually output?
How the model chooses its next word. Temperature, top-k, and top-p control creativity vs. predictability.
The bars show the probability of each possible next word. The model has produced a distribution — now we must sample from it.
Temperature: controls randomness. Low (0.1) = very focused, picks the top word. High (2.0) = more random, creative, surprising.
Top-K: only consider the K most likely words. K=5 means ignore everything except the top 5 candidates.
Top-P (nucleus): keep words until cumulative probability reaches P. P=0.9 means keep the smallest set of words that covers 90% probability.
Code generation: low temperature (0.2) — you want precision. Creative writing: higher temperature (0.8–1.0). Brainstorming: high temperature (1.2+). Claude typically uses temperature 0.7–1.0.
The model outputs a raw score for every word in its vocabulary (~100K words)
Divide all scores by temperature. Low temp → sharp peaks. High temp → flatter distribution.
Convert scores to probabilities (0–1, sum to 100%). Higher score → higher probability.
Keep only the K most probable words. Discard the rest (probability → 0).
Keep smallest set of words whose probabilities sum to ≥ P. Further reduces candidates.
Randomly pick one word from remaining candidates, weighted by probability. This becomes the next token.
The chosen token is appended to the sequence. The entire process runs again to pick the next token. This continues until the model outputs a stop token or reaches the maximum length. Claude generates ~50–100 tokens per second this way.
You now understand every piece. Final chapter: how Claude is built — and what happens when you press Enter.
Claude is a Transformer built by Anthropic. What makes it special isn't the architecture — it's how it's trained.
Every concept you've learned — working together in under 2 seconds.
"What is gravity?"
Your text travels to Anthropic's servers
"What" "is" "grav" "ity" "?" → 5 token IDs
Each token → a vector of ~4000 numbers
Add position info — word order matters
Each: attention → feed-forward → normalize
128 heads find relevant context per token
Billions of learned parameters transform the signal
Final layer outputs score for every vocabulary word
Temperature + top-k + top-p → pick next token
Append token, run again. ~50-100 tokens/sec
Tokens stream back to your screen as text
For a typical response of 200 words (~270 tokens): 270 complete passes through all 96 transformer layers. Each pass involves billions of multiplications across attention heads and feed-forward networks. Total time: 2-5 seconds.
The model has no memory between conversations — your entire chat is re-read from scratch each time. It doesn't "think" — it predicts the next token, over and over. Yet from this simple loop, reasoning, creativity, and understanding emerge.
Learn language from trillions of words by predicting the next token, over and over. Like reading every book, article, and website ever written — building a massive internal model of how language works.
Reinforcement Learning from Human Feedback. Human raters compare pairs of responses: "Response A is better than Response B." The model learns to prefer helpful, accurate, well-structured answers.
Claude's unique approach. Instead of relying only on human raters, Claude evaluates its own responses against a set of principles — like having an internal critic.
When you chat with Claude: your prompt → tokens → through all 96+ transformer layers with attention → predict one token at a time → stream the response back to you.
After pre-training, the model can generate fluent text — but it might be unhelpful, offensive, or confidently wrong. It learned to predict text, not to be useful. RLHF (Reinforcement Learning from Human Feedback) teaches the model what humans actually want: helpful, honest, harmless responses.
1. The model generates multiple responses to the same prompt.
2. Human raters rank them: "Response A > Response B > Response C."
3. A reward model is trained on these rankings — it learns to score responses the way humans would.
4. The main model is fine-tuned using reinforcement learning to maximize the reward model's score — while staying close to its original behavior.
Without RLHF, asking "How do I make a cake?" might get a Wikipedia-style essay about the history of baking. With RLHF, you get a clear recipe. The model's knowledge doesn't change — RLHF changes how it presents that knowledge. It's the difference between a knowledgeable professor who lectures at you and one who actually answers your question. Claude uses RLHF plus Constitutional AI (self-critique against principles) — which scales better because it doesn't need human raters for every improvement.
| Component | What it does | Analogy |
|---|---|---|
| Tokenizer | Splits text into sub-word pieces | Breaking words into syllables |
| Embedding | Maps each token to a ~4000-dimensional vector | GPS coordinates in meaning-space |
| Multi-Head Attention | 128 attention heads look at context simultaneously | 128 readers each highlighting different relationships |
| Feed-Forward Network | Processes the attended information through 2 dense layers | A "thinking" step — transform what was noticed |
| Layer Normalization | Stabilizes signal magnitude between layers | Keeping the volume level consistent |
| Residual Connection | Adds the input back to the output of each sub-layer | A shortcut — preserving the original signal |
| Softmax Output | Final probability distribution over ~100K vocabulary tokens | Ranking all possible next words by likelihood |
Tokens are the input and output. Sub-word pieces the model reads and writes. "Hello, how are you?" becomes ~6 tokens. The model generates tokens one at a time. Tokens flow through the model — they're the data, not the model itself.
Weights are the learnable numbers inside the model. Every connection between neurons has a weight. Every attention head has weight matrices (Q, K, V). These numbers were learned during training — they are the model's knowledge. When we say a model "knows" something, it's encoded in these weights.
Parameters = weights + biases. The total count of all learnable numbers. Every weight is a parameter. Every bias is a parameter. "Claude has billions of parameters" means billions of numbers adjusted during training.
Tokens flow through the network. At each layer, they're multiplied by weights (transformed). Total weights + biases = parameters. More parameters = more capacity to learn. Training adjusts all parameters to minimize loss. Inference passes tokens through fixed parameters.
| Model size | Parameters | What it means | Analogy |
|---|---|---|---|
| 7B | 7 billion | A capable model that runs on a single GPU. Popular for local/open-source use. | A detailed city map |
| 13B | 13 billion | Nearly 2× capacity. Better at nuance and complex reasoning. | A regional atlas |
| 70B | 70 billion | Needs multiple GPUs. Significantly more capable at complex tasks. | A national encyclopedia |
| ~400B+ | Hundreds of billions | Frontier models (Claude, GPT-4). Massive compute. State-of-the-art. | The internet's knowledge, compressed |
A 7B model = 7 billion weights, each typically 2 bytes → ~14 GB of raw data. But parameter count alone doesn't determine quality — training data quality, training method (RLHF, Constitutional AI), and architecture choices matter enormously. A well-trained 7B can outperform a poorly trained 13B. This is why Anthropic invests in training methodology, not just scale.
You understand the concepts. Now watch a GPT get built from scratch — line by line, with animated diagrams.
How a GPT Works — Line by Line →