From Zero to AI — An Interactive Journey

01

Functions

Everything in AI starts here. A function is a machine: input goes in, output comes out.

Hover to explore the function

What you're seeing

Hover the graph. Your mouse position = input x. The dot = output f(x).

Where the curve is steep, small input changes cause big output changes. Where it's flat, output barely moves.

The dashed crosshair tracks your position. The coordinates update in real-time below.

Move your mouse over the graph

🌍 Real World

Claude GPT-4 Gemini are all massive functions: text goes in → text comes out. Training = finding the right function.

02

Slopes & Derivatives

Nudge the input a tiny bit — how much does the output change? That ratio is the derivative.

Drag the yellow dot along the curve

What you're seeing

Drag the yellow dot along the curve. The red line is the tangent — its angle IS the slope at that point.

For f(x) = x², the derivative is 2x. At x=3, slope=6 — the output changes 6× faster than the input.

Where slope = 0, the function has a minimum or maximum. AI uses this to find optimal answers.

Drag to see slope change

🌍 Why AI cares

The derivative tells Claude: "adjusting this parameter will improve the answer by exactly this much." It's the GPS of optimization.

03

Gradient Descent

THE algorithm behind all AI training. Follow the slope downward to minimize error.

What you're seeing

Yellow ball = the model's current parameters. Dark blue = low error (good). Red/orange = high error (bad).

Red arrow points uphill (gradient direction). The ball moves the opposite way — downhill toward lower error.

White trail shows the path taken. Each white dot is one step. Watch it spiral into the valley.

Press Play or Step to begin

Learning rate 0.05

Momentum 0.10

Steps 1

💡 Try this

Set learning rate to 0.60 — watch it overshoot! Set momentum to 0.90 — it builds up speed like a heavy ball. Set Steps to 10 for batch updates.

🌍 Scale

Training GPT-4 does gradient descent on trillions of parameters across thousands of GPUs. The same algorithm you see here — just at unimaginable scale.

04

AI, ML & Deep Learning

The hierarchy from broad intelligence to the specific models you use every day.

AI — Artificial Intelligence Any system that exhibits intelligent behavior

ML — Machine Learning Systems that learn from data

DL — Deep Learning Multi-layer neural networks

LLMs Claude · GPT · Gemini

Three ways machines learn

🎓

Supervised Learning

🖼️ image → model → "cat"

Show examples with correct answers. The model learns to map inputs to outputs. Like studying with flashcards.

Claude pre-training GPT-4 Gemini

🔍

Unsupervised Learning

No answers given — find structure on its own. Discovers clusters, patterns, and anomalies in raw data.

Clustering Anomaly detection

🎮

Reinforcement Learning

action → environment → +reward

Learn by trial and error. Take actions, receive rewards or penalties. Like training a dog with treats.

Claude RLHF AlphaGo Robotics

Essential vocabulary

Term	Plain English	In AI	Scale
Parameter	A knob you can turn	A learnable number in the model	GPT-4: ~1.8 trillion
Training	Practice makes perfect	Adjusting all parameters to reduce error	Months of GPU computation
Loss	How wrong you are	A number measuring prediction error	Lower = better model
Epoch	One study session	One pass through all training data	Models train for many epochs
Inference	Using what you learned	Running the trained model on new data	When you chat with Claude
Batch size	How many flashcards per round	Examples processed before updating weights	Typically 256 to 4096

05

Neural Networks

Stack simple functions into layers. Each layer transforms data, learning increasingly abstract patterns.

Hover neurons to explore connections

What you're seeing

Each circle = a neuron. It computes: output = activation(Σ(weight × input) + bias)

Green lines = positive weights (excitatory). When the input neuron fires, it encourages the connected neuron to fire too.

Red lines = negative weights (inhibitory). When the input fires, it suppresses the connected neuron. Thicker = stronger effect.

Hover any neuron to see its specific connections highlighted and its activation value.

Hover over neurons to explore

⚖️ Weights — what are they really?

A weight is a number on each connection that controls how much influence one neuron has on the next. Weight = 0.9 means "pass most of the signal through." Weight = -0.3 means "pass a little, inverted." Weight = 0 means "ignore completely." During training, the network adjusts every weight to reduce errors — that's what learning is.

➕ Biases — the offset

Each neuron also has a bias — a number added after all the weighted inputs are summed. It shifts the neuron's activation threshold. Without bias, a neuron with all-zero inputs would always output zero. The bias lets it fire even with weak inputs, or stay quiet even with strong ones. Think of it as the neuron's "default mood."

🔢 The neuron formula

output = activation(w₁×x₁ + w₂×x₂ + ... + wₙ×xₙ + bias)

Each input xᵢ is multiplied by its weight wᵢ, all products are summed, the bias is added, then an activation function squashes the result. This single formula — repeated billions of times across layers — is how Claude thinks.

📊 Data flow example

3 input neurons receive pixel values [0.8, 0.2, 0.5]. Each value is multiplied by connection weights, summed at the next neuron, bias added, then activated. 2 output neurons produce probabilities like [0.9, 0.1] → "90% cat, 10% dog." Training adjusts every weight and bias until these outputs are correct.

06

Activation Functions

Without these, a 100-layer network is no better than 1 layer. They give neural networks their power.

⚡ Compare activation functions

Why we need activation functions

Without activations, multiplying matrices is always linear: y = Wx + b. Stacking 100 linear layers collapses to one. Activation functions add curves, letting the network model complex patterns.

ReLU: if negative → output 0. If positive → pass through. Simple, fast, most popular. Used in almost all modern networks.

Sigmoid: squashes everything to 0–1. Used when you need probabilities. "How confident is the model?"

Tanh: outputs -1 to +1. Centered at zero, which helps training converge faster.

GELU: smooth version of ReLU. The modern standard — used in Claude GPT-4 Gemini.

ReLU: max(0, x) — the most popular

📐 Linear vs non-linear

A linear function can only draw straight lines/flat planes — 2D boundaries. Non-linear activations let the network bend and twist decision boundaries in any dimension, modeling complex patterns like language, images, and reasoning.

07

Backpropagation

The chain rule traces errors backward through every layer, computing each parameter's contribution to the mistake.

What you're seeing

Forward pass (green signals): data flows left→right through each layer, producing a prediction at the output.

Backward pass (red signals): error gradients flow right→left. Each weight learns: "how much did I contribute to the mistake?"

Chain rule: ∂Loss/∂w = ∂Loss/∂output × ∂output/∂w. Multiply local gradients layer by layer.

After backward: weight = weight − lr × gradient. Every weight adjusts proportionally. Repeat billions of times.

Click Run to watch the signal flow

1

→

Forward

Data flows through layers → prediction

2

✗

Loss

Compare prediction to truth — how wrong?

3

←

Backward

Chain rule sends gradients back

4

↻

Update

Adjust every weight by its gradient

5

∞

Repeat

Billions of times across data

08

Tokens & Embeddings

Text → sub-word pieces → high-dimensional number vectors. This is how AI reads.

Original text

The curious cat jumped over the lazy dog

✂️ tokenize

Tokens (sub-word pieces)

📊 embed into vectors

Embedding vectors (~1000+ numbers each)

What you're seeing

Tokenization: text is split into sub-word pieces. "understanding" → ["under", "stand", "ing"]. This lets models handle any word — even made-up ones.

Embedding: each token is mapped to a vector of ~1000+ numbers. Similar-meaning tokens end up close together in this "meaning space."

Position encoding: adds information about WHERE each token appears. "Dog bites man" ≠ "man bites dog" — word order matters!

Each token → unique ID → high-dimensional vector

🌍 Context windows

Claude 200K tokens. GPT-4 128K. Gemini up to 1M. One token ≈ ¾ of a word in English.

09

Attention

Instead of reading sequentially, attention lets every word look at every other word simultaneously.

Hover over words to see attention patterns

What you're seeing

Hover any word. The arcs show how much attention that word pays to every other word. Thicker = more attention.

The highlighted word is the "source." Percentages show attention weights — they always sum to 100%.

Notice: "cat" attends strongly to "curious" (its adjective) and "jumped" (its verb). The model learns which relationships matter from data.

Hover words to explore attention patterns

💡 How attention resolves "it"

In "The cat sat on the mat because it was tired" — what does "it" refer to? Attention lets the model look back and connect "it" to "cat" based on learned patterns. This is why transformers understand context so well.

10

Transformers

"Attention Is All You Need" (2017). The architecture behind Claude, GPT, and Gemini.

EInput

→

QQuery

KKey

VValue

→

SScore

→

OOutput

Attention(Q,K,V) = softmax(QK^T / √d) × V

Query, Key, Value

Query (Q): "What am I looking for?" — each token generates a query describing what context it needs.

Key (K): "What do I contain?" — each token generates a key describing its information.

Value (V): "Here's my content" — the actual information to retrieve when key matches query.

Score: Q·K dot product measures similarity. softmax normalizes to probabilities. Multiply by V to get weighted information.

📚 Library analogy

Q = your search query. K = labels on each book. V = the book content. Attention finds which books match your query, then gives you a weighted summary of their contents.

The full transformer pipeline

1

Tokenize

Split text into tokens

2

Embed

Map to vectors

3

Self-Attention

Q·K·V for each token

4

Feed-Forward

Process attended info

5

×96 layers

Stack deep

6

Predict

Next token probability

96

Transformer layers in frontier models

128

Attention heads per layer

~1T

Tokens seen during training

~1T

Learnable parameters

Transformer variants — not all models are the same

The standard transformer is the foundation, but teams have developed key variations to improve efficiency and capability.

🧱

Dense Transformer

input → ALL neurons → output

Every token passes through every parameter in every layer. Simple, powerful, but computationally expensive. All parameters are active for every input.

Claude GPT-4o LLaMA Gemma

🔀

Mixture of Experts (MoE)

input → router → 2 of 16 experts

Has many "expert" sub-networks, but a router picks only 2–4 experts per token. Total parameters are huge but compute per token is small. More efficient at scale.

Mixtral Grok DeepSeek-V3 Kimi K2

🔄

State Space Models (SSM)

input → state → output

Process sequences through a compressed state instead of attending to all tokens. Much faster for very long sequences. Linear complexity vs quadratic for attention.

Mamba Jamba Zamba

Architecture	Total params	Active per token	Strength	Trade-off
Dense	All	All (100%)	Maximum quality per parameter	Expensive to run — every param used every time
MoE	Very large (e.g. 600B)	Small subset (e.g. 40B)	Fast inference despite huge total size	Harder to train, needs careful routing
SSM/Mamba	Moderate	All	Very fast on long sequences	May lose fine-grained attention detail
Hybrid (SSM + Attention)	Varies	Varies	Best of both — speed + precision	Complex architecture design

💡 MoE in plain English

Imagine a hospital with 16 specialist doctors but each patient only sees 2 of them. A router (triage nurse) decides which specialists each patient needs. The hospital has the expertise of 16 doctors but the cost of running 2. That's MoE — a model like Mixtral 8x7B has 47B total parameters but only activates ~13B per token. DeepSeek-V3 uses 671B total but activates only ~37B. The result: frontier-level quality at a fraction of the compute cost.

🔗 Go deeper — see it in real code

Now that you understand the concepts, see a real GPT built line by line: How a GPT Works — Line by Line walks through Karpathy's 200-line microgpt.py with animated diagrams, showing exactly how tokens, attention, and backpropagation work in actual Python code.

Also explore: Transformer Explainer (live GPT-2 in browser) · LLM Visualization (3D walkthrough).

11

Temperature & Sampling

How the model chooses its next word. Temperature, top-k, and top-p control creativity vs. predictability.

What you're seeing

The bars show the probability of each possible next word. The model has produced a distribution — now we must sample from it.

Temperature: controls randomness. Low (0.1) = very focused, picks the top word. High (2.0) = more random, creative, surprising.

Top-K: only consider the K most likely words. K=5 means ignore everything except the top 5 candidates.

Top-P (nucleus): keep words until cumulative probability reaches P. P=0.9 means keep the smallest set of words that covers 90% probability.

Adjust the sliders to see how sampling changes

Temperature 0.70

Top-K 10

Top-P 0.90

💡 When to use what

Code generation: low temperature (0.2) — you want precision. Creative writing: higher temperature (0.8–1.0). Brainstorming: high temperature (1.2+). Claude typically uses temperature 0.7–1.0.

How sampling works, step by step

1

📊

Raw logits

The model outputs a raw score for every word in its vocabulary (~100K words)

2

🌡️

Apply temperature

Divide all scores by temperature. Low temp → sharp peaks. High temp → flatter distribution.

3

📈

Softmax

Convert scores to probabilities (0–1, sum to 100%). Higher score → higher probability.

4

✂️

Apply Top-K

Keep only the K most probable words. Discard the rest (probability → 0).

5

🎯

Apply Top-P

Keep smallest set of words whose probabilities sum to ≥ P. Further reduces candidates.

6

🎲

Sample

Randomly pick one word from remaining candidates, weighted by probability. This becomes the next token.

🔁 Then repeat

The chosen token is appended to the sequence. The entire process runs again to pick the next token. This continues until the model outputs a stop token or reaches the maximum length. Claude generates ~50–100 tokens per second this way.

12

How Claude Works

Claude is a Transformer built by Anthropic. What makes it special isn't the architecture — it's how it's trained.

What happens when you press Enter

Every concept you've learned — working together in under 2 seconds.

1

⌨️

You type

"What is gravity?"

2

🌐

API call

Your text travels to Anthropic's servers

3

✂️

Tokenize

"What" "is" "grav" "ity" "?" → 5 token IDs

4

📊

Embed

Each token → a vector of ~4000 numbers

5

📍

Position

Add position info — word order matters

6

🔁

96 layers

Each: attention → feed-forward → normalize

7

🔦

Attention

128 heads find relevant context per token

8

⚖️

Weights

Billions of learned parameters transform the signal

9

📈

Logits

Final layer outputs score for every vocabulary word

10

🌡️

Sample

Temperature + top-k + top-p → pick next token

11

🔄

Repeat

Append token, run again. ~50-100 tokens/sec

12

💬

Stream

Tokens stream back to your screen as text

⏱️ The numbers

For a typical response of 200 words (~270 tokens): 270 complete passes through all 96 transformer layers. Each pass involves billions of multiplications across attention heads and feed-forward networks. Total time: 2-5 seconds.

🤯 What makes it feel magical

The model has no memory between conversations — your entire chat is re-read from scratch each time. It doesn't "think" — it predicts the next token, over and over. Yet from this simple loop, reasoning, creativity, and understanding emerge.

The four phases of building Claude

Phase 1

Pre-training

Learn language from trillions of words by predicting the next token, over and over. Like reading every book, article, and website ever written — building a massive internal model of how language works.

Task: Given "The cat sat on the ___", predict "mat".
Scale: Trillions of tokens, thousands of GPUs, months of compute.

Phase 2

RLHF

Reinforcement Learning from Human Feedback. Human raters compare pairs of responses: "Response A is better than Response B." The model learns to prefer helpful, accurate, well-structured answers.

Process: Generate → Human rates → Reward model → Fine-tune.
Goal: Align model behavior with human preferences.

Phase 3

Constitutional AI

Claude's unique approach. Instead of relying only on human raters, Claude evaluates its own responses against a set of principles — like having an internal critic.

Process: Generate → Self-critique against principles → Revise → Improve.
Why unique: Scales better than human feedback alone. Claude can self-improve.

Phase 4

Inference

When you chat with Claude: your prompt → tokens → through all 96+ transformer layers with attention → predict one token at a time → stream the response back to you.

Speed: ~50-100 tokens/second generated.
Each token: passes through every layer, attending to all previous tokens.

What is RLHF?

🤔 The problem RLHF solves

After pre-training, the model can generate fluent text — but it might be unhelpful, offensive, or confidently wrong. It learned to predict text, not to be useful. RLHF (Reinforcement Learning from Human Feedback) teaches the model what humans actually want: helpful, honest, harmless responses.

📋 How RLHF works, step by step

1. The model generates multiple responses to the same prompt.
2. Human raters rank them: "Response A > Response B > Response C."
3. A reward model is trained on these rankings — it learns to score responses the way humans would.
4. The main model is fine-tuned using reinforcement learning to maximize the reward model's score — while staying close to its original behavior.

💡 Why it matters

Without RLHF, asking "How do I make a cake?" might get a Wikipedia-style essay about the history of baking. With RLHF, you get a clear recipe. The model's knowledge doesn't change — RLHF changes how it presents that knowledge. It's the difference between a knowledgeable professor who lectures at you and one who actually answers your question. Claude uses RLHF plus Constitutional AI (self-critique against principles) — which scales better because it doesn't need human raters for every improvement.

Inside each transformer layer

Component	What it does	Analogy
Tokenizer	Splits text into sub-word pieces	Breaking words into syllables
Embedding	Maps each token to a ~4000-dimensional vector	GPS coordinates in meaning-space
Multi-Head Attention	128 attention heads look at context simultaneously	128 readers each highlighting different relationships
Feed-Forward Network	Processes the attended information through 2 dense layers	A "thinking" step — transform what was noticed
Layer Normalization	Stabilizes signal magnitude between layers	Keeping the volume level consistent
Residual Connection	Adds the input back to the output of each sub-layer	A shortcut — preserving the original signal
Softmax Output	Final probability distribution over ~100K vocabulary tokens	Ranking all possible next words by likelihood

Tokens, parameters, and weights — how they connect

🔤 Tokens

Tokens are the input and output. Sub-word pieces the model reads and writes. "Hello, how are you?" becomes ~6 tokens. The model generates tokens one at a time. Tokens flow through the model — they're the data, not the model itself.

⚖️ Weights

Weights are the learnable numbers inside the model. Every connection between neurons has a weight. Every attention head has weight matrices (Q, K, V). These numbers were learned during training — they are the model's knowledge. When we say a model "knows" something, it's encoded in these weights.

🎛️ Parameters

Parameters = weights + biases. The total count of all learnable numbers. Every weight is a parameter. Every bias is a parameter. "Claude has billions of parameters" means billions of numbers adjusted during training.

🔗 How they connect

Tokens flow through the network. At each layer, they're multiplied by weights (transformed). Total weights + biases = parameters. More parameters = more capacity to learn. Training adjusts all parameters to minimize loss. Inference passes tokens through fixed parameters.

What does "7B model" mean?

Model size	Parameters	What it means	Analogy
7B	7 billion	A capable model that runs on a single GPU. Popular for local/open-source use.	A detailed city map
13B	13 billion	Nearly 2× capacity. Better at nuance and complex reasoning.	A regional atlas
70B	70 billion	Needs multiple GPUs. Significantly more capable at complex tasks.	A national encyclopedia
~400B+	Hundreds of billions	Frontier models (Claude, GPT-4). Massive compute. State-of-the-art.	The internet's knowledge, compressed

💡 Size isn't everything

A 7B model = 7 billion weights, each typically 2 bytes → ~14 GB of raw data. But parameter count alone doesn't determine quality — training data quality, training method (RLHF, Constitutional AI), and architecture choices matter enormously. A well-trained 7B can outperform a poorly trained 13B. This is why Anthropic invests in training methodology, not just scale.

NEXT STEP

See it in real code

You understand the concepts. Now watch a GPT get built from scratch — line by line, with animated diagrams.

How a GPT Works — Line by Line →